In long term ecological studies, it is important to quantify changes that occur on biodiversity and the ecosystem as a whole. Large scale temporal and spatial studies to understand the natural and anthropogenic induced population dynamics are demanded by the scientific community. In addition, recent anuran population declines around the world have motivated studies to gain an understanding of the phenomenon diechman . One common way to obtain information is to assess frog communities since they are considered accurate indicators of environmental stress due to their aquatic and terrestrial habitatSimon2011 ; Relyea2005 . Researchers have been recording anuran audio signals in a labor intensive task that generates an ever increasing amount of audio data by using hand-held microphones, and networks of automated programmable recording equipment that stays in the field for months at a timeFarina2017 . Manual analysis of hundreds of hours of audio is rather impractical and involves a long and tedious process acevedo_using_2006 which has been aided in recent years by the use of modern software that relies on Machine Learning (ML) algorithmsdiechman . Many important efforts have been made lately to provide a one-fits-all solution; however, evidence suggests that site and taxa specific algorithms are required to obtain the high levels of accuracy and reliability in automatic animal recognition systems necessary to extract ecological information XIE2016627 ; Towsey2012 ; Potamitis2014c ; Ulloa2016 ; Bardeli2010 . Many frog call recognition approaches have been proposed in the literature, yet it remains unclear their suitability for the analysis of long audio samples recorded on the wild in places with high biological diversity using legacy recording equipment without a systematic approach. Years of frog-call data collection with variable quality audio samples remain unidentified and archived in audio repositories. Thus, the need for ML aid in site-specific frog species presence-absence estimation based on frog call detection in long audio recordings.
Male frogs use acoustic signaling for advertisement purposes to attract potential mates, defend their territory and show distressduellman_biology_1994 . Anuran vocalizations are commonly composed of a call that is formed by one or many sequenced notes also known as syllables. A syllable is an acoustic signal produced by air blown though vocal cords and resonated by a vocal sac (duellman_biology_1994, , Chapter 4). A single call is chosen as the basic element for frog species detection since it exhibits heterospecific nature.
Several studies found in the literature are focused on frog species automatic recognition. For instance, Brandes brandes_feature_2008
introduced feature vectors extracted from spectrograms, and modeled calls of
frogs recorded in the Amazon basin with hidden Markov models (HMM). Huan et al.huang_frog_2009 developed a frog sound identification system extracting features representing frog call syllables previously segmented reporting up to
recognition rate using support vector machine (SVM) classification. The dataset consisted ofspecies, 2 of which were clearly misclassified requiring further analysis. Lee et al.lee_automatic_2006 proposed a method using averaged Mel-frequency cepstral coefficients (MFCC) and linear discriminant analysis (LDA) to automatically identify types of frogs. Chen et al. chen_automatic_2012
suggested a method based on pre-classification of syllable lengths, and a multi-stage averaged spectrum (MSAS) with template matching. This approach reported the best recognition rate on a dataset of 18 frog calls when compared to other methods based on dynamic time warping (DTW), k-nearest-neighbor (kNN) and SVM. Bedoya et al.bedoya_automatic_2014 suggested an unsupervised methodology for automatic identification based on a fuzzy classifier and MFCCs. The method was tested successfully with species of anurans found in Colombia. Aboudan et al. aboudan_acoustic_2013
tested the ability of MFCC and linear predictive cepstral coefficients (LPCC) in the frog recognition process using Gaussian mixture models (GMM), but no real-world recordings containing frog calls was studied. Recently, an end-to-end Deep Neural Networks approach using convolutional neural nets (CNN) to classify spectrograms have been tried exhibiting 77% classification accuracy showing a limitation in using that approach when a little amount of training data is available, which is normally the case with new species in the field. Xie et al.XIE2016627 ; XIE201713
proposed an intelligent system for estimating frog species richness and abundance that presented important results in long recordings made in Australia using a combination of acoustic features and random forests. These studies are a very important contribution to the state-of-the-art. However, none reported the application of their algorithms for the analysis of real-world, noise-contaminated audio recordings from an environment such as the Amazonian rainforest of Yasuní National Park (YNP) in eastern Ecuador.
MFCCs have been applied in an out-of-the-box fashion, which exhibited limitations in their ability to model animal sound as reported in Towsey2012 ; cheng_call-independent_2010 ; fox_call-independent_2008 . This behavior is expected since the Mel-frequency filter bank used to generate MFCCs was designed based on the auditory properties of human hearing slaney_auditory_1998 and aims to model human voice.
In this paper, we propose a modification to the Mel-scale filter-bank based on the spectral content of frog calls to obtain a modified cepstral feature set (m-FCC), and compare it experimentally to the performance of standard MFCC and PLP features sets used in speaker recognition. We performed experiments to find the minimum time of frog calls required to train accurate GMMs and investigated the hyperparameters of the models that minimize the error rate in the training-development set. In addition, a one-vs-all Receiver Operating Characteristics (ROC) analysis per class was performed to identify a threshold vector to allow a likelihood-ratio detector to reject sound segments that do not belong to the model set. The threshold is applied to control the sensitivity and specificity of the detection desired per class. For testing, we trained 10 GMMs of frog species using the labeled training-development dataset111An on-line demonstration is available at http://puceing.edu.ec:9001/Reconocimiento.aspx, and applied those models to estimate frog species presence-absence in 141 (23.5 hours), 10-minute-long audio samples from a different distribution, with reduced quality, that was not used for training and validation of the algorithm. Performance evaluation in the practical presence-absence task validates the proposed approach in real-world conditions and prove the utility of the algorithm when unidentified acoustic data require analysis.
This paper is organized as follows. Section 2 describes the study site, the recording protocol used to register the frog calls in the wild, and the acoustic characteristics of the species available. Section 3 details the procedure followed for selection and annotation of the ground-truth dataset used in the experiments. Front-end segmentation is described and the modified cepstral features filter-bank presented. Section 4 is divided in two parts. The first one explains the experimental design and results of the parameter investigation. The second part describes the testing phase in real audio samples made by researchers in the wild. Finally, a discussion is presented in section 5, and conclusions summarized in section 6.
2.1 Study Site
Frog calls were recorded within Yasuní National Park, which is located in the central eastern sector of the Ecuadorian Amazon region (S, W), in the provinces of Orellana and Pastaza (Figure 1). This national park area is primarily rainforest that lies within the Napo moist forests ecoregion, and is considered one of the most biodiverse places on earthbass_global_2010 . It was designated as UNESCO Biosphere Reserve in 1989. Its climate is characterized by warm temperatures averaging to for all months; rainfall is high, approximately throughout the year. Relative humidity of YNP is between 80% and 94%. Average elevation of the park is low, from approximately to above sea level; the territory is frequently crossed by hills of to high. Soil is mostly geologically young, product of fluvial sediments by the erosion of the Andes mountains ministerio_del_ambiente_plan_2011 .
2.2 Acoustic Environment
The acoustic environment of the Amazon basin is known to present a challenge for signal processing algorithms in automatic analysis of recordings sueur2012global . This region has a tropical rainforest soundscape with high sound diversity riede_monitoring_1993 . In this paper, we focused on frog calls and any other sound source is considered noise. Previous studies sueur2012global have identified three main types of noise when recording soundscapes; namely, biotic noise, antrophogenic noise and environmental noise. A combination of these types of noise is present in the dataset used for this study. Some recordings contain antrophogenic noise like human voice and 60 Hertz ”humming” of a nearby electric generator while others biotic noise from insects, mammals and nearby species. In addition, we identified noise of broad-band transient nature that resulted from friction of the microphone boom with the surrounding vegetation and water drops falling on the microphone while recording. Figure 2 shows the typical pond soundscape found within YNP in which a chorus of Rinhella margaritifera amidst anthropogenic and biotic noise could be observed.
2.3 Acoustic Recording Protocol
The audio database containing frog calls used in this study was provided by Museo de Zoología (QCAZ) of the Pontificia Universidad Católica del Ecuador (PUCE) ron_amphibiawebecuador._2016 . The material was unlabeled and a few files contained only voice annotation made in the field. For training and validation experiments, we used recordings made with a Sennheiser K6-ME67TM unidirectional microphone attached to digital recorders Olympus LS-10TM or Marantz PMD660TM with sampling frequencies of and at 16-bit resolution. Sound was archived in lossless WAV files in order to preserve the integrity of the audio. The recording schedule was from 7:00 p.m. to 2:00 a.m. at natural ponds and trails located within the YNP by different researchers during distinct sessions ranging from 2003 to 2013. Since locating the exact position of frogs calling in the wild at night is a difficult task, frog calls were registered aiming the microphone to the zone where the frogs were heard calling. Distance to the frog is therefore not available and varies within the dataset from a few meters to tens of meters for loud species. Considering that the distance to the frog is uncertain, we focused on the SNR when evaluating the detectability of a frog call in the sound file estrella_selection_2017 .
The audio used for testing was recorded at two ponds located close to PUCE’s Yasuní research station by placing an omni-directional microphone 1.5 meters above the surface attached to a cassette recorder on a daily schedule during February (8 days), April (17 days), July (12 days), August (16 days), and September (13 days) of 2001. Pond 1 was recorded from 8:50 p.m. to 9:00 p.m, and Pond 2 from 1:50 a.m. to 2:00 a.m. The recordings were performed prior to a Visual Encounter SurveyPadilla:Thesis:2005 . The analog audio was transfered to digital audio in WAV format at 48000 kHz sampling rate using a USB digital audio converter in 2012.
2.4 Study Species
We selected for our experiments the frog species listed in Table 1, which were chosen based upon availability at the time of labeling. Although more than 130 frog species have been identified so far in the study zone, typically only a few species are active at the same time and place. This is an important constraint for an automated analysis method since only a small subset of acoustic models are required for classification given a geographic location and timespan. To account for calls or sounds that are not modeled by the system, the option of unknown sound is included and its output can be studied by an specialist if necessary. Acoustic power in the frog calls of Table 1 is mostly distributed in the range 430 to 7500 Hz depending on the species. Spectrograms of calls for each species with 1024 samples, 50% overlap and Blackman-Harris window are shown in Figure 3.
|Code||Species||# of Calls||Seconds||Freq. range in Hz|
|Boana alfaroi||98||44.7||1660 - 3100|
|Dendropsophus bifurcus||103||53.1||2300 - 3390|
|Boana cinerascens||169||105.5||1300 - 1530|
|Pristimantis conspicillatus||186||78||1630 - 3900|
|Leptodactylus discodactylus||330||146.3||1680 - 3260|
|Osteocephalus fuscifacies||50||23.52||1000 - 2500|
|Boana lanciformis||97||57.7||500 - 2720|
|Rhinella margaritifera||108||124.2||800 - 1700|
|Dendropsophus parviceps||38||40.1||5660 - 7500|
|Engystomops petersi||316||137.2||430 - 3140|
3.1 Frog-call Dataset
We generated a ground-truth corpus of frog calls for training and testing ML algorithmsestrella_selection_2016 ; dataset . From the unlabeled audio provided by QCAZ museum of Zoology, we manually selected audio files containing frog calls. Nine species had enough acoustic material, from 40 to 146 seconds of calls, that allowed the creation of training-validation subsets used in the first set of experiments. Since classes were unbalanced, the split was made selecting 6,12 and 18 seconds of calls for training and the rest used for validation. was also included to generate a model for long recording analysis during testing. Most frog calls were chosen with SNR higher than 3 dB, but we also included calls with background noise and some interference to study the performance of the algorithm in the noisy conditions that occur in the study zone. Field recordings containing human voice, mechanical artifacts or inter-specific overlapping calls were used neither to create the training-validation dataset nor to train the final GMMs. Table 1 presents the number of calls and seconds of audio per each species available in the labeled dataset.
Labeling of frog calls was aided by a short time energy (STE) based automatic segmentation algorithm described in section 3.2. Automatic segmentation of frog calls into syllables have been previously attempted by Jaafar et al. jaafar_automatic_2013 with interesting results. The front-end STE segmentation algorithm outputs the start and end points of a segment containing frog calls within the selected portion of audio. Each segment was manually labeled according to the species it belonged to by placing cue points signaling the start and end of the section containing calls. Automatic segmentation was preferred since an early attempt to perform manual endpoint selection resulted in lack of consistency among different annotators.
For testing the algorithm in real world conditions, a subset of 18 audio samples of the 141 unidentified files was manually labeled by AET using headphones and spectrogram visualization. A vector of 10 binary variables (representing the species) was assigned per sample according to [,,,,,,,,,], in which one is presence and zero absence. For instance, [0 1 0 0 0 0 0 0 0 0], represents presence of D. bifurcus and absence of all the others.
3.2 Frog Call Segmentation
Since a frog-call was chosen as the basic element of species identification, a segmentation technique that detects calls while avoiding portions of silence and noise was required for front-end processing. We adapted a classic voice analysis silence-removal method rabiner_algorithm_1975 based on bandpass-filtering, STE estimation and thresholding. Figure 4 shows the algorithm pipeline.
First, the whole audio sample was divided into consecutive 30-second frames. A band-pass finite impulse response (FIR) filter was applied to the original audio signal. The cut-off frequencies are user defined and were chosen based on the frequency range spanning most of the call energy of the objective species. Table 1
shows the frequency ranges of the filters used to generate the training set. The boundaries were calculated as the points were the spectral power is -20 dB relative to the point of maximum power of the frog call. It should be noted that the filter is applied only prior audio segmentation, but the original unfiltered audio is used for feature extraction and classification. Second, a STE sequence is generated fromconsecutive frames with no-overlapping of the filtered signal according to equation 1.
where is the energy of frame , is the filtered discrete-time signal and is the number of samples of each frame.
A moving-average (MA) filter was applied to the STE sequence to smooth transients and delimit STE of the whole frog call (or consecutive calls) instead of each separate note. The value of the MA = 12 was chosen empirically since it is related to the minimum frog-call duration that will be segmented.
3.3 Endpoint Detection
The smoothed STE sequence was then transformed to , , and the following routine was applied to estimate the start and end points of the frog calls in the frame.
Define a threshold value according to
. Where C is a constant determined empirically.
If consecutive values of are over the threshold, set a start-point. Subsequently, if consecutive values are below the threshold set the end-point.
The threshold allows for fine tuning the sensitivity of the endpoint detection. Its value is related to the SNR of the segmented audio that undergoes classification. Figure 5 shows calls of Dendrophsophus bifurcus detected applying the segmentation algorithm to a field recording.
3.4 Acoustic features extraction
Mel-frequency cepstral coefficients (MFCC) mermelstein1976distance and perceptual linear predictive analysis (PLP) Hermansky1992 have been the dominant feature sets used in automatic speech recognition systems hermansky2013 as well as in automatic recognition of animal sounds with interesting results in birds fox_call-independent_2008 ; cheng_call-independent_2010 , odontocetes roch_gaussian_2007 , anurans bedoya_automatic_2014 . Those feature sets are optimized for human voice processing and have been applied mostly without modification to the problem of animal sound recognition obtaining important results. However, a close observation to the spectral energy of frog-calls reveal a different distribution than that of human voice. Therefore, it is not optimal to apply standard MFCC or PLP features without some modification to capture the spectral characteristics that differentiate frog calls. We propose using hand-crafted cepstral coefficients with a modified filter-bank distribution following the layout shown in Figure 6 for frog-call recognition and compared its performance with standard MFCC and PLP-RASTA features. The procedure to extract the cepstral feature set is summarized as follows:
Fragments of sound containing frog-calls resulting from the segmentation step described in Section 3.2 were divided into 20 ms frames with 75% overlap. Each frame was then pre-emphasized using the filter described by
and Hamming-windowed to minimize discontinuities on the edges. The discrete Fourier transform (DFT) was taken and the triangular-shaped 40-element filterbank of Figure6(b) was applied.
The log of the energy of each filter was obtained and the discrete cosine transform (DCT) of the resultant vector of log-energies calculated to decorrelate the energies. Finally, the 20 first elements of the resultant vector were concatenated per each frame and the resultant matrix used as feature set for the classification step. A detailed description of Mel-cepstrum computation can be found in cheng_call-independent_2010 ; bedoya_automatic_2014 ; mermelstein1976distance .
3.5 Frog species model representation
Gaussian mixture models have been shown to accurately model speaker identities when short-utterances of unconstrained speech are available for classification. reynolds_robust_1995 . In the case of animal sound recognition, GMMs have been applied to identify vocalizations of individual birdscheng_call-independent_2010 and marine mammals roch_gaussian_2007 . Since our aim is to identify the species of frogs calling in audio samples, we modeled each species by multi-variate GMMs. Formally,
A Gaussian mixture density is defined according to:
where is a -dimensional feature vector, are the component densities, and are the mixture weights that satisfy the constraint reynolds_robust_1995 . Each component density is a Gaussian function of variables:
with mean vector and covariance matrix . Each GMM is denoted by its mean vector, covariance matrix and the mixture weights according to:
We need a model for each species available in the labeled dataset of Table 1. Thus, we generated a set of training vectors for every utterance in the dataset by extracting cepstral features at fixed time steps. The resultant matrix was then used to train models , ,…,
for each species by maximum likelihood estimation (MLE) using the expectation-maximization (EM) algorithm described indempster1977maximum ; reynolds_robust_1995 ; cheng_call-independent_2010
. The model was initialized by setting the mean values using the k-means++ algorithmarthur2007k , the initial covariance matrix was set as diagonal with element
as the variance of, and the initial mixing proportions were set as uniform.
3.6 Frog species identification
The output of front-end segmentation yields a sequence of vectors that contain characteristics of the sound source that generated it. We need to find the species model with the maximum a posterioriprobability for . Formally,
is the hypothesized frog species and S is the number of models. Even though prior probabilities could be defined based on the geographic location of the study, we assume identical prior probabilities of frog speciesand remove since it is the same for all models. The decision rule becomes:
Since front-end processing produces variable size audio segments, we need to normalize for size . Applying logarithms and assuming independence between observations, the species identification stage calculates:
3.7 Species detection
With the hypothesized frog species , we need to determine if the sound segment was actually produced by . To accomplish this goal, we applied the log-likelihood ratio statistic defined by:
where is a score that informs how likely it is for a given call segment to belong or not to the hypothesized species model that represents , and
is the probability density function of the set of alternative species in the model set. This task is known as verification in the speaker detection literatureReynolds2000 . Although more than one species usually call at the same time, for the application of presence-absence estimation we focused on single-species detection per segment. A close observation to the calling patterns of frogs during reproduction reveals that frogs call repeatedly in choruses sharing the time and frequency resources available. For a 10-minute sound sample, we assumed that at least one segment of sound is single-species composed. To represent the pdf of the alternative species model we applied the median() function of the set of non hypothesized models in the set :
To find a threshold of the score that permits detection of frog calls with high likelihood while rejecting non relevant sound segments (overlapped calls, noise, non modeled species, etc), we defined threshold values per class applying one-vs-all receiver operating characteristic (ROC) analysis.
3.8 Threshold vector selection
ROC curves hanley1982meaning ; Fawcett2005 for the multi-class detection problem are shown in Figure 7. A one-vs-all approach was used to calculate the true positive rate (TPR) versus the false positive rate (FPR) for each class in relation to a varying log-likelihood ratio score. To consider an acoustic event as a class detection, its score value must be above a defined threshold per class. Higher threshold values enable a selective detection while lower values increase the sensitivity. A trade-off between false detections and false rejections exists and the operating point depends on the application. In the case of frog call presence-absence estimation in long recordings it is desirable to minimize false alarms while maintaining a TPR as high as possible. Since the audio recorded in YNP was registered with a unique microphone in a fixed site, it is important to detect all the frog calls possible in the range of the microphone. Therefore, the operating point requires high TPR while maintaining FPR reasonably low. Some intuition of the behavior and frequency of calling per species is desirable when setting the threshold. We began with a fixed threshold vector of operating points corresponding to FPR of 5% for all the species and fine tuned the thresholds comparing the analysis output to the manually annotated corpus. The final threshold vector used in the analysis of section 4.3 was [5 3 6 9 6.75 5.25 5.5 11 6 6] and the corresponding operating points are shown in Figure 7.
4 Experimental Results
We divided this section in two parts. First, a set of experiments was designed to identify the hyperparameter values that perform best in recognizing the species of frogs from a frog call in the development dataset as well as the minimum time required to train accurate GMMs. Second, the parameters obtained in the first part were used to train production GMMs using all the material available in the dataset. Automatic analysis of real field recordings coming from a different distribution was performed, and its results compared to human level performance in the presence-absence task.
4.1 Parameter investigation
The number of components in a mixture needed to model frog species adequately, and the minimum training time required were determined by the following experiment. Nine frog models with 2, 4, 8, 16, 32 and 64 component Gaussian densities and diagonal covariance matrix were trained using 6, 12, 18 seconds of frog-call corresponding to 600, 1200 and 12800 12-dimensional mel-cepstral feature vectors. The dataset was divided into a training set of 6, 12 and 18 seconds and the remaining calls were used for testing. We applied 10-fold cross-validation with random segment selection on all the dataset per each species to model the distribution of the weighted error rate (WER). The WER was calculated as the average on the individual per species Bayesian error rate of the nine species to account for the unbalanced classes. Figure 8 shows the distribution of WER for different training times and number of mixtures M for the 10-fold cross-validation procedure.
The following observations can be made from the results. For the 3 training times tested, there was an increase in identification performance from 2 to 16 mixture components leveling off for 32 and 64 components. However, when training with 6 seconds the identification performance degrades with increasing model order suggesting that at least 12 seconds of training set and more than 32 mixture components are required to model the frog species adequately. This result is important since it provides a guideline for the lower limit of training set required when training frog species GMMs.
|Species||M = 2||M = 4||M = 8||M = 16||M = 32||M = 64|
Finally, in Table 2 the error rate per species is presented according to the number of Gaussian mixture components used to model each species with 12 seconds of training time and 20-mFCC. Variable error rates across distinct M suggests that the optimal model order is not the highest for each species in the data set. A different model order could be chosen for each species to avoid over-fitting as suggested by Cheng in cheng_call-independent_2010 . However, we selected model order M = 64 that gives the minimum error rate on the average since it is not clear how to choose the optimal model order for species that are not included in the training - validation set, but could be included in the future.
4.2 Features comparison
In order to investigate the WER with respect to the feature set used to model the frog calls, GMMs were trained using 12 MFCC, 20 MFCC, 20 PLP-RASTA and 20 modified-filterbank cepstral features(m-FCC). The results are shown in Figure 9 which presents the WER with respect to the number of mixture components for each feature set. Standard MFCC used in automatic speaker recognition exhibited the lowest performance with a slight improvement when the number of features increased from 12 to 20. Additionally, 19th PLP-RASTA outperformed standard MFCC performance by approximately 1% when more than 16 mixture components were used suggesting that the Bark filter-bank used to calculate the PLP feature set allows the spectral characteristics of the frog calls to be captured better than MFCC. Finally, classification performance of the 20 cepstral features calculated using the modified filter-bank described in Section 3.4 surpass the others. The results suggests that the modification to the filter-bank in order to model the spectral shapes frog calls rather than those of human voice is appropriate.
4.3 Field recordings analysis
Ten GMMs were trained using all the labeled dataset, and applied to analyze 23.5 hours of audio in 50 WAV files with focus on presence-absence estimation. Each file contained three 10-minute samples delimited using cue points to inform its position to the algorithm. The audio contains unidentified calls amidst different types of noise and distortion resultant from volume variation during recording, clipping, malfunctioning cable, microphone friction, rain and digitalization noise. Scanning each file took 50 seconds approximately with a laptop running a 2.6 GHz processor and 16 GB RAM. Figure 13 shows a 40-second snippet of segmentation stage, and classified segments with their respective likelihood-ratio score are presented in Figure 10.
Presence-absence estimation results of all the corpus are summarized in Figure 11 for both ponds studied. One binary vector per sample was obtained and plotted per day and month according to the sampling schedule available in the 5-month period. To evaluate its performance, the 18 audio samples of February were manually annotated by AET using headphones and spectrogram visualization. The resultant 10-variable binary vectors were compared to the output of the algorithm variable by variable, and binary classification performance metrics were applied. The resultant scores are presented in Table 3
Calls of B. alfaroi, P. conspicillatus, R. margaritifer and D. parviceps did not exist in the recordings and were correctly estimated as absent by the machine with the exception of one sample in July in which a false detection of P. conspicillatus occurred. In contrast, D. bifurcus was detected in all the annotated samples by both human and machine. The algorithm was able to estimate absence correctly for the species that did not call during the sample while presence of O. fuscifacies and B. lanciformis presented a challenge. Those species called only once or twice per sample, making it difficult to detect them during manual screening as well as during automatic analysis. Nevertheless, automatic analysis of February detected presence of O. fuscifacies in a sample that was not detected by manual labeling initially. Close observation of the sample in the position that the algorithm detected the call, enabled the researcher to identify and label that sample correctly in the case in which only one call existed.
It is observed that the call detector performed well in terms of accuracy and precision while maintaining high specificity. In other words, the detector exhibited no false positives in the labeled set (false species presence) and did not confound between detected species. On the other hand, recall of 0.875 signifies that a species present in the recording was not detected. This behavior was mostly due to species calling once during the sample time and pose a limitation that can be solved by increasing the sampling time or using a species specific approach setting a lower threshold for the desired species. These results prove that the proposed learner is able to generalize to real world audio that has not been used for training and validation, and was recorded with different equipment.
Overall, more detections occurred in Pond 1 than in Pond 2 suggesting higher acoustic activity in Pond 1 during the duration of the study. In Pond 1, the highest number of detections belonged to B. cinerascens accounting for 58% of detections followed by D. bifurcus with 23% and L. discodactylus with 17%. The remaining 2% belonged to O. fuscifacies and B. lanciformis. In contrast, Pond 2 shown mostly detections of D. bifurcus and E. petersi.
Figure 11 presents the number of detections of the three species which called the most during the sampling of February in Pond 1. A researcher can gain insights about the reproductive activity of those species with longer and planned acoustic samples. For instance, the circadian reproductive activity, and probably abundance might be extracted. However, it is still not clear how to extrapolate the number of males calling to the actual population including females and juveniles with the usage of the proposed approach in YNP.
Monitoring animal sound in the tropical rainforest using automatic approaches is known to be challenging problem because of the high amount of noise present and variable conditionsriede_monitoring_1993 ; XIE201713 . Currently, many algorithms have been proposed that allow researchers to study audio in search of calls of birdsUlloa2016 , insectsPotamitis2014c , odontocetesroch_gaussian_2007 , and frogsbedoya_automatic_2014 ; aboudan_acoustic_2013 ; XIE201713 . However, more research is needed to asses their suitability to study audio recordings made in frog communities with high biological diversity such as Yasuní National Park of Ecuador. An automatic approach to estimate presence-absence in long-term audio recorded on the site with local taxa is necessary to help researchers gain understanding of the dynamics and ecology of local frogs.
In this study, we applied frog call recognition with verification stage to study real-world audio recordings. Frog call classification have been attempted in selected calls with high SNR in previous studies bedoya_automatic_2014 ; brandes_feature_2008 showing that calls with low SNR were misclassified due to interference and noise. We found that it is not necessary to detect all the frog calls in the audio sample to achieve the goal of 10-species presence-absence estimation. As long as one call is detected, it is enough for species presence, which is highly probable in ponds where frogs call repeatedly to attract mates. Threshold setting then becomes an important step to tune the detector and set a desired operating point which minimize false alarms. Nonetheless, for frogs that called once during the sample, a limitation of this approach was observed which was also the case during human annotation by AET using headphones. For instance, calls of O. fuscifacies and B. lanciformis that occurred once or twice during the sample were not heard, and in consequence not labeled the first time. Observing the results of the algorithm in those samples, which shown detections of those species, enabled the researcher to go back to the recording and verify that they were really present. In that context, the algorithm already was helpful to complement human performance and save time for labeling acoustic samples, which is an important task to prepare ground truth for developing Machine Learning algorithms.
Even though previous studies suggested that MFCC coefficients are not suited for animal call recognitionTowsey2012 ; roch_gaussian_2007 , they used MFCCs for voice recognition without modification. We found that a filter-bank modification based on the spectral content of frog-calls enabled the resulting cepstral features to improve classification performance using GMMs. This is an important result since it suggests that hand-crafted cepstral coefficients perform better when focused on the spectral characteristics of the taxa of interest. Despite efforts to develop a one-suits-all system, generally algorithms that perform well in some situations tend to do poorly in other datasets as stated by the no free lunch theorem. We applied our approach to two distinct audio samples recorded at different ponds within YNP with consistent results, suggesting a good generalization capability. In addition, the audio used for training the GMMs was captured with digital equipment whereas the audio used for testing was recorded using an analog cassette recorder with different microphones. This result is important since suggests the possibility of studying audio coming from different sources and equipment, which is normally the case in the field where multiple people record calls in different timestamps.
Front-end processing is very important to obtain the results shown in this study. In XIE2016627 , Xie et al. proposed a frog acoustic activity detector in order to focus their classifier only in frog calls. We aimed to keep front-end segmentation as simple as possible to allow verification after classification to remove non-frog call audio. Verification was able to reject noise coming from malfunctioning cable, human voice, unknown species, calls overlap, cellphone noise, etc. The STE segmentation approach applied proved good for situations where frogs call intermittently with at least 10 ms of inactivity between calls. In species like D. bifurcus which call in choruses in a non-stop way a limitation was identified. Since this is a variable size segment approach focused on a band of interest, it could be improved if multi-frequency segmentation is applied as suggested in inproceedings
, and classification and verification applied to each resulting sequence and adding the species detected. Finally, recent advances in end-to-end convolutional neural networks CNNs and recurrent neural networks (RNN), that study all the audio without prior segmentation could provide a way to remove front-end processing altogether. Nonetheless, the processing power needed for that approach might be prohibitive when using desktop computers, and require cloud processing that is expensive. Research in Deep Neural Networks applications is advancing fast and we expect that using that approach important results will be obtained in the future. Therefore, we open the data-set to the research communitydataset and provide a baseline for comparison.
The proposed approach proved a helpful tool in estimating presence-absence of frog species in pond recordings made in the wilderness of YNP in Ecuador. Several hundred hours of unidentified acoustic material still exist at QCAZ-PUCE archive, and the application of automatic analysis could save researchers’ time with metadata generation that can be verified in a fraction of the time that takes to listen each recording one-by-one. Fast audio appraisal and inventory generation without the need of specialists can be performed with acceptable level of performance by using metadata from presence-absence estimation. In systematic Ecoacoustic recordings used in wildlife monitoring, summing the results of multiple learners trained with specific taxa cohorts might provide a way to estimate biodiversity and study the composition of the soundscape. However, it is difficult to deeply asses those applications at this stage without more acoustic data available.
Machine Learning aid in the automatic evaluation of frog communities in wildlife recordings introduces a potent technology that is complementary to existing survey techniques used currently by researchers in the wild. Our team is exploring diversity indexes estimation based on applying the proposed approach to 24-hour-long recordings made in the Mindo region in which critically endangered frog species call in a different setting with more silence between calls, which makes it simpler for front-end segmentation to extract calls from the background; thus, simplifying the digital signal processing pipeline.
Finally, automatic analysis of audio records of frog communities might be useful for researchers studying environmental changes since frog presence is related to the health of the ecosystem, and their disappearance provides clues of contamination or climate change effects that could be helpful in developing sustainable solutions. Important applications in wildlife surveillance are envisioned that could be enhanced by wireless acoustic sensors networks in the wilderness.
The study was supported by Pontificia Universidad Católica del Ecuador research projects 2015. Project L13304: ”Diseño de un algoritmo de procesamiento de audio para un sistema que permita la automatización del inventario, caracterización y el monitoreo de poblaciones de las ranas del Ecuador, caso de estudio Parque Nacional Yasuní”. We would like to thank Santiago Ron for providing access to the database of frog calls recorded in Yasuní National Park and for his valuable advice. To the staff at Estación Científica Yasuní for their cooperation during the execution of field recordings. To Jean Camino, Franco Cisneros and Eduardo Silva for their important contribution in manual labeling of the frog calls used in training and development. To Daniela Pareja and Daniel Rivadeneira for their help during field work recording frog calls at YNP trails. To Samael Padilla for kindly providing access to the data gathered during his thesis work. To Paloma Lima for her help in early stages of the manuscript.
J. L. Deichmann, O. Acevedo-Charry, L. Barclay, Z. Burivalova,
M. Campos-Cerqueira, F. d’Horta, E. T. Game, B. L. Gottesman, P. J. Hart,
A. K. Kalan, S. Linke, L. D. Nascimento, B. Pijanowski, E. Staaterman,
T. Mitchell Aide,
It’s time to
listen: there is much to be learned from the sounds of tropical ecosystems,
Biotropica 50 (5) (2018) 713–718.
- (2) E. Simon, M. Puky, M. Braun, B. Tóthmérész, Frogs and toads as biological indicators in environmental assessment, Nova Science Publishers, Inc., 2011, pp. 141–150.
R. A. Relyea,
lethal impact of roundup on aquatic and terrestrial amphibians, Ecological
Applications 15 (4) (2005) 1118–1124.
- (4) A. Farina, S. H. Gage, Ecoacoustics: The Ecological Role of Sounds, 2017. doi:10.1002/9781119230724.
M. A. Acevedo, L. J. Villanueva-Rivera,
Automated Digital Recording Systems as Effective Tools for the
Monitoring of Birds and Amphibians, Wildlife Society Bulletin 34 (1)
J. Xie, T. Michael, J. Zhang, P. Roe,
frog calling activity based on acoustic event detection and multi-label
learning, Procedia Computer Science 80 (2016) 627 – 638, international
Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San
Diego, California, USA.
M. Towsey, B. Planitz, A. Nantes, J. Wimmer, P. Roe,
toolbox for animal call recognition, Bioacoustics 21 (2) (2012) 107–125.
- (8) I. Potamitis, Automatic classification of a taxon-rich community recorded in the wild, PLoS ONE 9 (5) (2014) 1–11. doi:10.1371/journal.pone.0096936.
J. S. Ulloa, A. Gasc, P. Gaucher, T. Aubin, M. Réjou-Méchain,
J. Sueur, Screening
large audio datasets to determine the time and space distribution of
Screaming Piha birds in a tropical forest, Ecological Informatics 31 (2016)
R. Bardeli, D. Wolff, F. Kurth, M. Koch, K. H. Tauchert, K. H. Frommolt, Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring, Pattern Recognition Letters 31 (12) (2010) 1524–1534.doi:10.1016/j.patrec.2009.09.014.
- (11) W. E. Duellman, L. Trueb, Biology of Amphibians, JHU Press, 1994.
- (12) T. S. Brandes, Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise, IEEE Transactions on Audio, Speech, and Language Processing 16 (6) (2008) 1173–1180. doi:10.1109/TASL.2008.925872.
C.-J. Huang, Y.-J. Yang, D.-X. Yang, Y.-J. Chen,
Using Machine Learning Techniques, Expert Syst. Appl. 36 (2) (2009)
C.-H. Lee, C.-H. Chou, C.-C. Han, R.-Z. Huang,
recognition of animal vocalizations using averaged MFCC and linear
discriminant analysis, Pattern Recognition Letters 27 (2) (2006) 93–101.
W.-P. Chen, S.-S. Chen, C.-C. Lin, Y.-Z. Chen, W.-C. Lin,
recognition of frog calls using a multi-stage average spectrum, Computers &
Mathematics with Applications 64 (5) (2012) 1270–1281.
C. Bedoya, C. Isaza, J. M. Daza, J. D. López,
recognition of anuran species based on syllable identification, Ecological
Informatics 24 (2014) 200–209.
A. Aboudan, Acoustic
Monitoring System for Frog Population Estimation Using
In-Situ Progressive Learning, Ph.D. thesis, Colorado State
J. Xie, M. Towsey, M. Zhu, J. Zhang, P. Roe,
intelligent system for estimating frog community calling activity and species
richness, Ecological Indicators 82 (2017) 13 – 22.
J. Cheng, Y. Sun, L. Ji,
call-independent and automatic acoustic system for the individual recognition
of animals: A novel model using four passerines, Pattern Recognition
43 (11) (2010) 3846–3852.
E. J. S. Fox, J. D. Roberts, M. Bennamoun,
Individual Identification in Birds, Bioacoustics 18 (1) (2008) 51–67.
Toolbox, Technical Report #1998-010 (1998).
M. S. Bass, M. Finer, C. N. Jenkins, H. Kreft, D. F. Cisneros-Heredia, S. F.
McCracken, N. C. A. Pitman, P. H. English, K. Swing, G. Villa, A. D. Fiore,
C. C. Voigt, T. H. Kunz,
Conservation Significance of Ecuador’s Yasuní National Park,
PLOS ONE 5 (1) (2010) e8767.
Ministerio del Ambiente,
de Manejo del Parque Nacional Yasuní, Ministerio del Ambiente del
Ecuador, Quito, Ecuador, 2011.
- (24) J. Sueur, A. Gasc, P. Grandcolas, S. Pavoine, Global estimation of animal diversity using automatic acoustic sensors, Sensors for ecology. Paris: CNRS (2012) 99–117.
biodiversity: analysis of Amazonian rainforest sounds, Ambio 22 (8) (1993)
S. R. Ron, A. Merino-Viteri, D. A. Ortiz,
2018. Anfibios del Ecuador.
Version 2018.0. Museo de Zoología, Pontificia Universidad Católica del
Ecuador., fecha de acceso 3 de octubre (2018).
A. Estrella, D. A. Nicolalde, C. Escobar,
range estimation of frog calls in Ecoacoustics long recordings, Revista
Pontificia Universidad Católica del Ecuador 106 (2018) 157–178.
- (28) S. Padilla, The reproductive activity of two anuran communities in Yasuní National Park, Bachelor thesis, Pontifical Catholic University of Ecuador, Quito, Ecuador (2005).
A. Estrella, D. A. Nicolalde, D. P. Nicolalde, D. Pareja,
Annotation and Labeling of Ecuadorian Frog Calls, Revista
Pontificia Universidad Católica del Ecuador 102 (2016) 37–54.
M. d. Z. QCAZ, A. E. Terneux, D. A. Nicolalde, D. P. Nicolalde,
Labeled frog-call dataset of
Yasuní National Park for training Machine Learning algorithms
- (31) H. Jaafar, D. A. Ramli, Automatic syllables segmentation for frog identification system, in: 2013 IEEE 9th International Colloquium on Signal Processing and its Applications, 2013, pp. 224–228. doi:10.1109/CSPA.2013.6530046.
- (32) L. R. Rabiner, M. R. Sambur, An algorithm for determining the endpoints of isolated utterances, The Bell System Technical Journal 54 (2) (1975) 297–315. doi:10.1002/j.1538-7305.1975.tb02840.x.
P. Mermelstein, Distance measures for speech recognition, psychological and instrumental, Pattern recognition and artificial intelligence 116 (1976) 374–388.
H. Hermansky, N. Morgan, A. Bayya, P. Kohn,
analysis technique, in: Proceedings of the 1992 IEEE International
Conference on Acoustics, Speech and Signal Processing - Volume 1, ICASSP’92,
IEEE Computer Society, Washington, DC, USA, 1992, pp. 121–124.
- (35) H. Hermansky, J. R. Cohen, R. M. Stern, Perceptual properties of current speech recognition technology, Proceedings of the IEEE 101 (9) (2013) 1968–1985. doi:10.1109/JPROC.2013.2252316.
- (36) M. A. Roch, M. S. Soldevilla, J. C. Burtenshaw, E. E. Henderson, J. A. Hildebrand, Gaussian mixture model classification of odontocetes in the Southern California Bight and the Gulf of California, The Journal of the Acoustical Society of America 121 (3) (2007) 1737–1748.
- (37) D. A. Reynolds, R. C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Transactions on Speech and Audio Processing 3 (1) (1995) 72–83. doi:10.1109/89.365379.
A. P. Dempster, N. M. Laird, D. B. Rubin,
Maximum Likelihood from
Incomplete Data via the EM Algorithm, Journal of the Royal Statistical
Society. Series B (Methodological) 39 (1) (1977) 1–38.
- (39) D. Arthur, S. Vassilvitskii, k-means++: The advantages of careful seeding, in: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
D. A. Reynolds, T. F. Quatieri, R. B. Dunn,
verification using adapted gaussian mixture models, Digital Signal
Processing 10 (1) (2000) 19 – 41.
- (41) J. A. Hanley, B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve., Radiology 143 (1) (1982) 29–36.
introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006)
861 – 874.
- (43) T. Giannakopoulos, A. Pikrakis, S. Theodoridis, A novel efficient approach for audio segmentation, 2009, pp. 1 – 4. doi:10.1109/ICPR.2008.4761654.