AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition

Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (A...

Full description

Saved in:

Bibliographic Details
Main Authors:	Na Che, Yiming Zhu, Haiyan Wang, Xianwei Zeng, Qinsheng Du
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Applied Sciences
Subjects:	speech recognition multimodal integration transformer adaptive fusion
Online Access:	https://www.mdpi.com/2076-3417/15/1/199
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841549438847287296
author	Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du
author_facet	Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du
author_sort	Na Che
collection	DOAJ
description	Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.
format	Article
id	doaj-art-47e30af03a3040d6b03becce996eb0a4
institution	Kabale University
issn	2076-3417
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-47e30af03a3040d6b03becce996eb0a42025-01-10T13:14:46ZengMDPI AGApplied Sciences2076-34172024-12-0115119910.3390/app15010199AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech RecognitionNa Che0Yiming Zhu1Haiyan Wang2Xianwei Zeng3Qinsheng Du4College of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaAiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.https://www.mdpi.com/2076-3417/15/1/199speech recognitionmultimodal integrationtransformeradaptive fusion
spellingShingle	Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition Applied Sciences speech recognition multimodal integration transformer adaptive fusion
title	AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_full	AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_fullStr	AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_full_unstemmed	AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_short	AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_sort	aft sam adaptive fusion transformer with a sparse attention mechanism for audio visual speech recognition
topic	speech recognition multimodal integration transformer adaptive fusion
url	https://www.mdpi.com/2076-3417/15/1/199
work_keys_str_mv	AT nache aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT yimingzhu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT haiyanwang aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT xianweizeng aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT qinshengdu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition

AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition

Similar Items