AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition
Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (A...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-12-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/15/1/199 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841549438847287296 |
---|---|
author | Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du |
author_facet | Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du |
author_sort | Na Che |
collection | DOAJ |
description | Aiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases. |
format | Article |
id | doaj-art-47e30af03a3040d6b03becce996eb0a4 |
institution | Kabale University |
issn | 2076-3417 |
language | English |
publishDate | 2024-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj-art-47e30af03a3040d6b03becce996eb0a42025-01-10T13:14:46ZengMDPI AGApplied Sciences2076-34172024-12-0115119910.3390/app15010199AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech RecognitionNa Che0Yiming Zhu1Haiyan Wang2Xianwei Zeng3Qinsheng Du4College of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaCollege of Computer Science and Technology, Changchun University, Changchun 130022, ChinaAiming at the problems of serious information redundancy, complex inter-modal information interaction, and difficult multimodal fusion faced by the audio–visual speech recognition system when dealing with complex multimodal information, this paper proposes an adaptive fusion transformer algorithm (AFT-SAM) based on a sparse attention mechanism. The algorithm adopts the sparse attention mechanism in the feature-encoding process to reduce excessive attention to non-important regions and dynamically adjusts the attention weights through adaptive fusion to capture and integrate the multimodal information more effectively and reduce the impact of redundant information on the model performance. Experiments are conducted on the audio–visual speech recognition dataset LRS2 and compared with other algorithms, and the experimental results show that the proposed algorithm in this paper has significantly lower WERs in the audio-only, visual-only, and audio–visual bimodal cases.https://www.mdpi.com/2076-3417/15/1/199speech recognitionmultimodal integrationtransformeradaptive fusion |
spellingShingle | Na Che Yiming Zhu Haiyan Wang Xianwei Zeng Qinsheng Du AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition Applied Sciences speech recognition multimodal integration transformer adaptive fusion |
title | AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition |
title_full | AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition |
title_fullStr | AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition |
title_full_unstemmed | AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition |
title_short | AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio–Visual Speech Recognition |
title_sort | aft sam adaptive fusion transformer with a sparse attention mechanism for audio visual speech recognition |
topic | speech recognition multimodal integration transformer adaptive fusion |
url | https://www.mdpi.com/2076-3417/15/1/199 |
work_keys_str_mv | AT nache aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT yimingzhu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT haiyanwang aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT xianweizeng aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition AT qinshengdu aftsamadaptivefusiontransformerwithasparseattentionmechanismforaudiovisualspeechrecognition |