Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusi...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei-Cheng Wang, Sander De Coninck, Sam Leroux, Pieter Simoens
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Robotics and AI
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841543871453986816
author Wei-Cheng Wang
Sander De Coninck
Sam Leroux
Pieter Simoens
author_facet Wei-Cheng Wang
Sander De Coninck
Sam Leroux
Pieter Simoens
author_sort Wei-Cheng Wang
collection DOAJ
description Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
format Article
id doaj-art-eb0150bbe11f46c1aab798b6d34f1212
institution Kabale University
issn 2296-9144
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Robotics and AI
spelling doaj-art-eb0150bbe11f46c1aab798b6d34f12122025-01-13T05:10:44ZengFrontiers Media S.A.Frontiers in Robotics and AI2296-91442025-01-011110.3389/frobt.2024.14907181490718Embedding-based pair generation for contrastive representation learning in audio-visual surveillance dataWei-Cheng WangSander De ConinckSam LerouxPieter SimoensSmart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/fullself-supervised learningsurveillanceaudio-visual representation learningcontrastive learningaudio-visual event localizationanomaly detection
spellingShingle Wei-Cheng Wang
Sander De Coninck
Sam Leroux
Pieter Simoens
Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
Frontiers in Robotics and AI
self-supervised learning
surveillance
audio-visual representation learning
contrastive learning
audio-visual event localization
anomaly detection
title Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
title_full Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
title_fullStr Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
title_full_unstemmed Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
title_short Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
title_sort embedding based pair generation for contrastive representation learning in audio visual surveillance data
topic self-supervised learning
surveillance
audio-visual representation learning
contrastive learning
audio-visual event localization
anomaly detection
url https://www.frontiersin.org/articles/10.3389/frobt.2024.1490718/full
work_keys_str_mv AT weichengwang embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata
AT sanderdeconinck embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata
AT samleroux embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata
AT pietersimoens embeddingbasedpairgenerationforcontrastiverepresentationlearninginaudiovisualsurveillancedata