A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequen...

Full description

Saved in:
Bibliographic Details
Main Authors: Karn N. Watcharasupat, Chih-Wei Wu, Yiwei Ding, Iroro Orife, Aaron J. Hipple, Phillip A. Williams, Scott Kramer, Alexander Lerch, William Wolcott
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Open Journal of Signal Processing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10342812/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841533455021637632
author Karn N. Watcharasupat
Chih-Wei Wu
Yiwei Ding
Iroro Orife
Aaron J. Hipple
Phillip A. Williams
Scott Kramer
Alexander Lerch
William Wolcott
author_facet Karn N. Watcharasupat
Chih-Wei Wu
Yiwei Ding
Iroro Orife
Aaron J. Hipple
Phillip A. Williams
Scott Kramer
Alexander Lerch
William Wolcott
author_sort Karn N. Watcharasupat
collection DOAJ
description Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
format Article
id doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c6
institution Kabale University
issn 2644-1322
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Signal Processing
spelling doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c62025-01-16T00:02:28ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-015738110.1109/OJSP.2023.333942810342812A Generalized Bandsplit Neural Network for Cinematic Audio Source SeparationKarn N. Watcharasupat0https://orcid.org/0000-0002-3878-5048Chih-Wei Wu1https://orcid.org/0000-0002-9019-6515Yiwei Ding2https://orcid.org/0000-0002-8156-3715Iroro Orife3https://orcid.org/0000-0002-3030-2312Aaron J. Hipple4https://orcid.org/0009-0003-5957-480XPhillip A. Williams5https://orcid.org/0009-0003-4521-3827Scott Kramer6https://orcid.org/0009-0007-9365-0588Alexander Lerch7https://orcid.org/0000-0001-6319-578XWilliam Wolcott8https://orcid.org/0009-0001-6772-8202Netflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USACinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.https://ieeexplore.ieee.org/document/10342812/Cinematic audiodeep learningpsychoacoustical frequency scalesource separation
spellingShingle Karn N. Watcharasupat
Chih-Wei Wu
Yiwei Ding
Iroro Orife
Aaron J. Hipple
Phillip A. Williams
Scott Kramer
Alexander Lerch
William Wolcott
A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
IEEE Open Journal of Signal Processing
Cinematic audio
deep learning
psychoacoustical frequency scale
source separation
title A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_full A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_fullStr A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_full_unstemmed A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_short A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_sort generalized bandsplit neural network for cinematic audio source separation
topic Cinematic audio
deep learning
psychoacoustical frequency scale
source separation
url https://ieeexplore.ieee.org/document/10342812/
work_keys_str_mv AT karnnwatcharasupat ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT chihweiwu ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT yiweiding ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT iroroorife ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT aaronjhipple ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT phillipawilliams ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT scottkramer ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT alexanderlerch ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT williamwolcott ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT karnnwatcharasupat generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT chihweiwu generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT yiweiding generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT iroroorife generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT aaronjhipple generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT phillipawilliams generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT scottkramer generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT alexanderlerch generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation
AT williamwolcott generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation