A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequen...

Full description

Saved in:

Bibliographic Details
Main Authors:	Karn N. Watcharasupat, Chih-Wei Wu, Yiwei Ding, Iroro Orife, Aaron J. Hipple, Phillip A. Williams, Scott Kramer, Alexander Lerch, William Wolcott
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Open Journal of Signal Processing
Subjects:	Cinematic audio deep learning psychoacoustical frequency scale source separation
Online Access:	https://ieeexplore.ieee.org/document/10342812/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841533455021637632
author	Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott
author_facet	Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott
author_sort	Karn N. Watcharasupat
collection	DOAJ
description	Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
format	Article
id	doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c6
institution	Kabale University
issn	2644-1322
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of Signal Processing
spelling	doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c62025-01-16T00:02:28ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-015738110.1109/OJSP.2023.333942810342812A Generalized Bandsplit Neural Network for Cinematic Audio Source SeparationKarn N. Watcharasupat0https://orcid.org/0000-0002-3878-5048Chih-Wei Wu1https://orcid.org/0000-0002-9019-6515Yiwei Ding2https://orcid.org/0000-0002-8156-3715Iroro Orife3https://orcid.org/0000-0002-3030-2312Aaron J. Hipple4https://orcid.org/0009-0003-5957-480XPhillip A. Williams5https://orcid.org/0009-0003-4521-3827Scott Kramer6https://orcid.org/0009-0007-9365-0588Alexander Lerch7https://orcid.org/0000-0001-6319-578XWilliam Wolcott8https://orcid.org/0009-0001-6772-8202Netflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USACinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.https://ieeexplore.ieee.org/document/10342812/Cinematic audiodeep learningpsychoacoustical frequency scalesource separation
spellingShingle	Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation IEEE Open Journal of Signal Processing Cinematic audio deep learning psychoacoustical frequency scale source separation
title	A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_full	A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_fullStr	A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_full_unstemmed	A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_short	A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
title_sort	generalized bandsplit neural network for cinematic audio source separation
topic	Cinematic audio deep learning psychoacoustical frequency scale source separation
url	https://ieeexplore.ieee.org/document/10342812/
work_keys_str_mv	AT karnnwatcharasupat ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT chihweiwu ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT yiweiding ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT iroroorife ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT aaronjhipple ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT phillipawilliams ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT scottkramer ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT alexanderlerch ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT williamwolcott ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT karnnwatcharasupat generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT chihweiwu generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT yiweiding generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT iroroorife generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT aaronjhipple generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT phillipawilliams generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT scottkramer generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT alexanderlerch generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT williamwolcott generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Similar Items