EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing

The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-vis...

Full description

Saved in:

Bibliographic Details
Main Authors:	Donghyeok Jo, Jun-Hwa Kim, Jihoon Jeon, Chee Sun Won
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Audio-visual deep learning on-screen sound separation edge computing
Online Access:	https://ieeexplore.ieee.org/document/10830501/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841542584275566592
author	Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won
author_facet	Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won
author_sort	Donghyeok Jo
collection	DOAJ
description	The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>.
format	Article
id	doaj-art-d6d7e5ee876148ebb7a4944114e7d21c
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-d6d7e5ee876148ebb7a4944114e7d21c2025-01-14T00:02:23ZengIEEEIEEE Access2169-35362025-01-01136387639610.1109/ACCESS.2025.352675710830501EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge ComputingDonghyeok Jo0Jun-Hwa Kim1https://orcid.org/0000-0002-2548-8853Jihoon Jeon2Chee Sun Won3https://orcid.org/0000-0002-3400-0792Department of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Artificial Intelligence, Konyang University, Daejeon, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaThe ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>.https://ieeexplore.ieee.org/document/10830501/Audio-visual deep learningon-screen sound separationedge computing
spellingShingle	Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing IEEE Access Audio-visual deep learning on-screen sound separation edge computing
title	EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_full	EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_fullStr	EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_full_unstemmed	EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_short	EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
title_sort	egosep egocentric on screen sound source separation for real time edge computing
topic	Audio-visual deep learning on-screen sound separation edge computing
url	https://ieeexplore.ieee.org/document/10830501/
work_keys_str_mv	AT donghyeokjo egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT junhwakim egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT jihoonjeon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT cheesunwon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing

EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing

Similar Items