Multimodal fusion-powered English speaking robot

IntroductionSpeech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.MethodsTo overcome these issues, we propose a novel framework-EnglishAL-...

Full description

Saved in:
Bibliographic Details
Main Author: Ruiying Pan
Format: Article
Language:English
Published: Frontiers Media S.A. 2024-11-01
Series:Frontiers in Neurorobotics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fnbot.2024.1478181/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846166976496402432
author Ruiying Pan
author_facet Ruiying Pan
author_sort Ruiying Pan
collection DOAJ
description IntroductionSpeech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.MethodsTo overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language.Results and discussionIn the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.
format Article
id doaj-art-7efe7699a41747c2aa1a2e8f2c47167d
institution Kabale University
issn 1662-5218
language English
publishDate 2024-11-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Neurorobotics
spelling doaj-art-7efe7699a41747c2aa1a2e8f2c47167d2024-11-15T06:13:38ZengFrontiers Media S.A.Frontiers in Neurorobotics1662-52182024-11-011810.3389/fnbot.2024.14781811478181Multimodal fusion-powered English speaking robotRuiying PanIntroductionSpeech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.MethodsTo overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language.Results and discussionIn the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.https://www.frontiersin.org/articles/10.3389/fnbot.2024.1478181/fullALBEFNeural Machine Translation (NMT)cross-attention mechanismmultimodal robotspeech recognition
spellingShingle Ruiying Pan
Multimodal fusion-powered English speaking robot
Frontiers in Neurorobotics
ALBEF
Neural Machine Translation (NMT)
cross-attention mechanism
multimodal robot
speech recognition
title Multimodal fusion-powered English speaking robot
title_full Multimodal fusion-powered English speaking robot
title_fullStr Multimodal fusion-powered English speaking robot
title_full_unstemmed Multimodal fusion-powered English speaking robot
title_short Multimodal fusion-powered English speaking robot
title_sort multimodal fusion powered english speaking robot
topic ALBEF
Neural Machine Translation (NMT)
cross-attention mechanism
multimodal robot
speech recognition
url https://www.frontiersin.org/articles/10.3389/fnbot.2024.1478181/full
work_keys_str_mv AT ruiyingpan multimodalfusionpoweredenglishspeakingrobot