Comparing acoustic representations for deep learning-based classification of underwater acoustic signals: A case study on orca (Orcinus orca) vocalizations
Passive acoustic monitoring of marine mammal vocalizations often relies on automated detectors to process large quantities of data. Many automated systems use spectrograms as a way to represent acoustic information, including those built on deep artificial neural networks (DNNs). Spectrograms transf...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-12-01
|
| Series: | Ecological Informatics |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1574954125003061 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Passive acoustic monitoring of marine mammal vocalizations often relies on automated detectors to process large quantities of data. Many automated systems use spectrograms as a way to represent acoustic information, including those built on deep artificial neural networks (DNNs). Spectrograms transform acoustic time series into the time–frequency domain, highlighting how the sound energy distribution across frequencies changes over time. Marine mammals often have unique spectral signatures that can be used for detection and species identification. The spectrogram is well-suited for many such pattern recognition algorithms, including those developed for computer vision, such as convolutional neural networks. However, while it emphasizes some aspects of the signal, it downplays others. This statement is also true for most other ways of representing acoustic information. In this study, we compare 9 acoustic representations and evaluate how they affect the performance of a DNN in classifying acoustic signals. Specifically, we use a dataset of orca (Orcinus orca) vocalizations to build binary classifiers that attempt to distinguish between orca sounds and typical environmental noise, including other biological sounds. Representations of the non-stationary acoustic time series considered include: magnitude, mel, and CQT spectrograms, waveforms, cepstrograms, time and frequency similarity matrices, and evolutionary autocorrelation and autocovariance. DNNs were built for each of these representations singly, and we also built DNNs that combined two representations as inputs. We assess the performance of these representations relative to the commonly used magnitude spectrogram, with the F1score as the central performance metric. The baseline magnitude spectrogram resulted in a 0.82 median F1score (over 15 trials), and its classification performance was surpassed by the frequency similarity (median: 0.88), time similarity (0.88) and mel spectrogram (0.84). DNNs that used a combination of representations achieved higher performances than the respective single representations, with the best model using a combination of the mel spectrogram and the frequency similarity matrix and achieving an F1score of 0.92. For our case study, we recommend a combination of mel spectrograms and frequency similarity matrices for the orca detectors focusing on stereotypical tonal calls. In general, we encourage developers working on similar tools to consider testing and combining different acoustic representations for improved classification performance. |
|---|---|
| ISSN: | 1574-9541 |