EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing
The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-vis...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10830501/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841542584275566592 |
---|---|
author | Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won |
author_facet | Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won |
author_sort | Donghyeok Jo |
collection | DOAJ |
description | The ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>. |
format | Article |
id | doaj-art-d6d7e5ee876148ebb7a4944114e7d21c |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-d6d7e5ee876148ebb7a4944114e7d21c2025-01-14T00:02:23ZengIEEEIEEE Access2169-35362025-01-01136387639610.1109/ACCESS.2025.352675710830501EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge ComputingDonghyeok Jo0Jun-Hwa Kim1https://orcid.org/0000-0002-2548-8853Jihoon Jeon2Chee Sun Won3https://orcid.org/0000-0002-3400-0792Department of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Artificial Intelligence, Konyang University, Daejeon, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaDepartment of Electronics and Electrical Engineering, Dongguk University, Seoul, South KoreaThe ability to identify specific sounds in noisy environments can be improved by incorporating visual information through audio-visual integration, leveraging visual cues such as lip reading and sound-producing object recognition. Recent advancements in deep learning have enabled effective audio-visual sound source separation methods. Simultaneously, the increasing adoption of wearable devices capable of processing audio-visual information has further driven the demand for On-screen Sound source Separation (OSS), particularly in dynamic, egocentric scenarios. However, OSS in these scenarios remains several technical challenges, such as adapting to rapidly changing perspectives, ensuring real-time performance on resource-constrained edge devices, and developing computationally efficient learning strategies. To address these challenges, we propose EgoSep, a method designed for Egocentric On-screen Sound Source Separation(Ego-OSS). EgoSep integrates appearance and motion features from visual data with audio features extracted using a U-Net-based encoder, enabling robust separation in dynamic environments. The method is evaluated using the signal-to-noise ratio (SNR), treating on-screen sounds as signals and off-screen sounds as noise. For the experiments, we combine two public datasets: EPIC-KITCHENS, a large-scale egocentric video dataset, and ESC-50, an audio-only dataset. We simulate realistic scenarios by mixing EPIC-KITCHENS on-screen sounds with ESC-50 off-screen noise. Experimental results show that EgoSep effectively suppresses noise (i.e., off-screen sounds), improving the SNR of the test data from 3.05 dB at the input to 10.01 dB at the output. Additionally, real-time feasibility is validated on the NVIDIA Jetson Nano Developer Kit, achieving a real-time factor (RTF) of 0.17, demonstrating its practicality for wearable applications. The audio-mixed datasets and some results are available at <uri>https://donghyeok-jo.github.io/Ego-OSS</uri>.https://ieeexplore.ieee.org/document/10830501/Audio-visual deep learningon-screen sound separationedge computing |
spellingShingle | Donghyeok Jo Jun-Hwa Kim Jihoon Jeon Chee Sun Won EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing IEEE Access Audio-visual deep learning on-screen sound separation edge computing |
title | EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing |
title_full | EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing |
title_fullStr | EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing |
title_full_unstemmed | EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing |
title_short | EgoSep: Egocentric On-Screen Sound Source Separation for Real-Time Edge Computing |
title_sort | egosep egocentric on screen sound source separation for real time edge computing |
topic | Audio-visual deep learning on-screen sound separation edge computing |
url | https://ieeexplore.ieee.org/document/10830501/ |
work_keys_str_mv | AT donghyeokjo egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT junhwakim egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT jihoonjeon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing AT cheesunwon egosepegocentriconscreensoundsourceseparationforrealtimeedgecomputing |