EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing

The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-vis...

Full description

Saved in:
Bibliographic Details
Main Authors: Donghyeok Jo, Jun-Hwa Kim, Jihoon Jeon, Chee Sun Won
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10830501/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841542584275566592
author Donghyeok Jo
Jun-Hwa Kim
Jihoon Jeon
Chee Sun Won
author_facet Donghyeok Jo
Jun-Hwa Kim
Jihoon Jeon
Chee Sun Won
author_sort Donghyeok Jo
collection DOAJ
description The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>.
format Article
id doaj-art-d6d7e5ee876148ebb7a4944114e7d21c
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d6d7e5ee876148ebb7a4944114e7d21c2025-01-14T00:02:23ZengIEEEIEEE Access2169-35362025-01-01136387639610.1109/ACCESS.2025.352675710830501EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge ComputingDonghyeok Jo0Jun-Hwa Kim1https://orcid.org/0000-0002-2548-8853Jihoon Jeon2Chee Sun Won3https://orcid.org/0000-0002-3400-0792Department of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Artificial Intelligence, Konyang University, Daejeon, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaThe ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>.https://ieeexplore.ieee.org/document/10830501/Audio-visual deep learningon-screen sound separationedge computing
spellingShingle Donghyeok Jo
Jun-Hwa Kim
Jihoon Jeon
Chee Sun Won
EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
IEEE Access
Audio-visual deep learning
on-screen sound separation
edge computing
title EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_full EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_fullStr EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_full_unstemmed EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_short EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_sort egosep egocentric on screen sound source separation for real time edge computing
topic Audio-visual deep learning
on-screen sound separation
edge computing
url https://ieeexplore.ieee.org/document/10830501/
work_keys_str_mv AT donghyeokjo egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing
AT junhwakim egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing
AT jihoonjeon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing
AT cheesunwon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing