Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusi...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-01-01
|
Series: | Frontiers in Robotics and AI |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841543871453986816 |
---|---|
author | Wei-Cheng Wang Sander De Coninck Sam Leroux Pieter Simoens |
author_facet | Wei-Cheng Wang Sander De Coninck Sam Leroux Pieter Simoens |
author_sort | Wei-Cheng Wang |
collection | DOAJ |
description | Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches. |
format | Article |
id | doaj-art-eb0150bbe11f46c1aab798b6d34f1212 |
institution | Kabale University |
issn | 2296-9144 |
language | English |
publishDate | 2025-01-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Robotics and AI |
spelling | doaj-art-eb0150bbe11f46c1aab798b6d34f12122025-01-13T05:10:44ZengFrontiers Media S.A.Frontiers in Robotics and AI2296-91442025-01-011110.3389/frobt.2024.14907181490718Embedding-based pair generation for contrastive representation learning in audio-visual surveillance dataWei-Cheng WangSander De ConinckSam LerouxPieter SimoensSmart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/fullself-supervised learningsurveillanceaudio-visual representation learningcontrastive learningaudio-visual event localizationanomaly detection |
spellingShingle | Wei-Cheng Wang Sander De Coninck Sam Leroux Pieter Simoens Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data Frontiers in Robotics and AI self-supervised learning surveillance audio-visual representation learning contrastive learning audio-visual event localization anomaly detection |
title | Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data |
title_full | Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data |
title_fullStr | Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data |
title_full_unstemmed | Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data |
title_short | Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data |
title_sort | embedding based pair generation for contrastive representation learning in audio visual surveillance data |
topic | self-supervised learning surveillance audio-visual representation learning contrastive learning audio-visual event localization anomaly detection |
url | https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/full |
work_keys_str_mv | AT weichengwang embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata AT sanderdeconinck embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata AT samleroux embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata AT pietersimoens embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata |