AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition

Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (A...

Full description

Saved in:
Bibliographic Details
Main Authors: Na Che, Yiming Zhu, Haiyan Wang, Xianwei Zeng, Qinsheng Du
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/1/199
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841549438847287296
author Na Che
Yiming Zhu
Haiyan Wang
Xianwei Zeng
Qinsheng Du
author_facet Na Che
Yiming Zhu
Haiyan Wang
Xianwei Zeng
Qinsheng Du
author_sort Na Che
collection DOAJ
description Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.
format Article
id doaj-art-47e30af03a3040d6b03becce996eb0a4
institution Kabale University
issn 2076-3417
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-47e30af03a3040d6b03becce996eb0a42025-01-10T13:14:46ZengMDPI AGApplied Sciences2076-34172024-12-0115119910.3390/app15010199AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech RecognitionNa Che0Yiming Zhu1Haiyan Wang2Xianwei Zeng3Qinsheng Du4College of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaAiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.https://www.mdpi.com/2076-3417/15/1/199speech recognitionmultimodal integrationtransformeradaptive fusion
spellingShingle Na Che
Yiming Zhu
Haiyan Wang
Xianwei Zeng
Qinsheng Du
AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
Applied Sciences
speech recognition
multimodal integration
transformer
adaptive fusion
title AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_full AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_fullStr AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_full_unstemmed AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_short AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
title_sort aft sam adaptive fusion transformer with a sparse attention mechanism for audio visual speech recognition
topic speech recognition
multimodal integration
transformer
adaptive fusion
url https://www.mdpi.com/2076-3417/15/1/199
work_keys_str_mv AT nache aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition
AT yimingzhu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition
AT haiyanwang aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition
AT xianweizeng aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition
AT qinshengdu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition