Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation

Audio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific person...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wenchao Song, Qiong Liu, Yanchao Liu, Pengzhou Zhang, Juan Cao
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Applied Sciences
Subjects:	talking head generation neural radiance fields audio-visual feature fusion cross-modal content generation
Online Access:	https://www.mdpi.com/2076-3417/15/1/479
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841549334274899968
author	Wenchao Song Qiong Liu Yanchao Liu Pengzhou Zhang Juan Cao
author_facet	Wenchao Song Qiong Liu Yanchao Liu Pengzhou Zhang Juan Cao
author_sort	Wenchao Song
collection	DOAJ
description	Audio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific person with more natural motion compared to the one-shot methods. However, previous approaches failed to distinguish the importance of different regions, resulting in the loss of information-rich region features. To alleviate the problem and improve video quality, we propose MLDF-NeRF, an end-to-end method for talking head generation, which can achieve better vector representation through multi-level feature dynamic fusion. Specifically, we designed two modules in MLDF-NeRF to enhance the cross-modal mapping ability between audio and different facial regions. We initially developed a multi-level tri-plane hash representation that uses three sets of tri-plane hash networks with varying resolutions of limitation to capture the dynamic information of the face more accurately. Then, we introduce the idea of multi-head attention and design an efficient audio-visual fusion module that explicitly fuses audio features with image features from different planes, thereby improving the mapping between audio features and spatial information. Meanwhile, the design helps to minimize interference from facial areas unrelated to audio, thereby improving the overall quality of the representation. The quantitative and qualitative results indicate that our proposed method can effectively generate talk heads with natural actions and realistic details. Compared with previous methods, it performs better in terms of image quality, lip sync, and other aspects.
format	Article
id	doaj-art-2ecba286f9bd42cb992b27cd0ca3b7e2
institution	Kabale University
issn	2076-3417
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-2ecba286f9bd42cb992b27cd0ca3b7e22025-01-10T13:15:42ZengMDPI AGApplied Sciences2076-34172025-01-0115147910.3390/app15010479Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head GenerationWenchao Song0Qiong Liu1Yanchao Liu2Pengzhou Zhang3Juan Cao4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaAudio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific person with more natural motion compared to the one-shot methods. However, previous approaches failed to distinguish the importance of different regions, resulting in the loss of information-rich region features. To alleviate the problem and improve video quality, we propose MLDF-NeRF, an end-to-end method for talking head generation, which can achieve better vector representation through multi-level feature dynamic fusion. Specifically, we designed two modules in MLDF-NeRF to enhance the cross-modal mapping ability between audio and different facial regions. We initially developed a multi-level tri-plane hash representation that uses three sets of tri-plane hash networks with varying resolutions of limitation to capture the dynamic information of the face more accurately. Then, we introduce the idea of multi-head attention and design an efficient audio-visual fusion module that explicitly fuses audio features with image features from different planes, thereby improving the mapping between audio features and spatial information. Meanwhile, the design helps to minimize interference from facial areas unrelated to audio, thereby improving the overall quality of the representation. The quantitative and qualitative results indicate that our proposed method can effectively generate talk heads with natural actions and realistic details. Compared with previous methods, it performs better in terms of image quality, lip sync, and other aspects.https://www.mdpi.com/2076-3417/15/1/479talking head generationneural radiance fieldsaudio-visual feature fusioncross-modal content generation
spellingShingle	Wenchao Song Qiong Liu Yanchao Liu Pengzhou Zhang Juan Cao Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation Applied Sciences talking head generation neural radiance fields audio-visual feature fusion cross-modal content generation
title	Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
title_full	Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
title_fullStr	Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
title_full_unstemmed	Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
title_short	Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
title_sort	multi level feature dynamic fusion neural radiance fields for audio driven talking head generation
topic	talking head generation neural radiance fields audio-visual feature fusion cross-modal content generation
url	https://www.mdpi.com/2076-3417/15/1/479
work_keys_str_mv	AT wenchaosong multilevelfeaturedynamicfusionneuralradiancefieldsforaudiodriventalkingheadgeneration AT qiongliu multilevelfeaturedynamicfusionneuralradiancefieldsforaudiodriventalkingheadgeneration AT yanchaoliu multilevelfeaturedynamicfusionneuralradiancefieldsforaudiodriventalkingheadgeneration AT pengzhouzhang multilevelfeaturedynamicfusionneuralradiancefieldsforaudiodriventalkingheadgeneration AT juancao multilevelfeaturedynamicfusionneuralradiancefieldsforaudiodriventalkingheadgeneration

Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation

Similar Items