A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequen...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Open Journal of Signal Processing |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10342812/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841533455021637632 |
---|---|
author | Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott |
author_facet | Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott |
author_sort | Karn N. Watcharasupat |
collection | DOAJ |
description | Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem. |
format | Article |
id | doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c6 |
institution | Kabale University |
issn | 2644-1322 |
language | English |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Open Journal of Signal Processing |
spelling | doaj-art-f7a39c3dfb7f46e088e02e0f0ca110c62025-01-16T00:02:28ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-015738110.1109/OJSP.2023.333942810342812A Generalized Bandsplit Neural Network for Cinematic Audio Source SeparationKarn N. Watcharasupat0https://orcid.org/0000-0002-3878-5048Chih-Wei Wu1https://orcid.org/0000-0002-9019-6515Yiwei Ding2https://orcid.org/0000-0002-8156-3715Iroro Orife3https://orcid.org/0000-0002-3030-2312Aaron J. Hipple4https://orcid.org/0009-0003-5957-480XPhillip A. Williams5https://orcid.org/0009-0003-4521-3827Scott Kramer6https://orcid.org/0009-0007-9365-0588Alexander Lerch7https://orcid.org/0000-0001-6319-578XWilliam Wolcott8https://orcid.org/0009-0001-6772-8202Netflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USANetflix, Inc., Los Gatos, CA, USAMusic Informatics Group, Georgia Institute of Technology, Atlanta, GA, USANetflix, Inc., Los Gatos, CA, USACinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.https://ieeexplore.ieee.org/document/10342812/Cinematic audiodeep learningpsychoacoustical frequency scalesource separation |
spellingShingle | Karn N. Watcharasupat Chih-Wei Wu Yiwei Ding Iroro Orife Aaron J. Hipple Phillip A. Williams Scott Kramer Alexander Lerch William Wolcott A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation IEEE Open Journal of Signal Processing Cinematic audio deep learning psychoacoustical frequency scale source separation |
title | A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation |
title_full | A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation |
title_fullStr | A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation |
title_full_unstemmed | A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation |
title_short | A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation |
title_sort | generalized bandsplit neural network for cinematic audio source separation |
topic | Cinematic audio deep learning psychoacoustical frequency scale source separation |
url | https://ieeexplore.ieee.org/document/10342812/ |
work_keys_str_mv | AT karnnwatcharasupat ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT chihweiwu ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT yiweiding ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT iroroorife ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT aaronjhipple ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT phillipawilliams ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT scottkramer ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT alexanderlerch ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT williamwolcott ageneralizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT karnnwatcharasupat generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT chihweiwu generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT yiweiding generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT iroroorife generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT aaronjhipple generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT phillipawilliams generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT scottkramer generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT alexanderlerch generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation AT williamwolcott generalizedbandsplitneuralnetworkforcinematicaudiosourceseparation |