Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering

In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and leng...

Full description

Saved in:

Bibliographic Details
Main Authors:	Faheem Shehzad, Aniello Minutolo, Massimo Esposito
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Visual question answering (VQA) transformer models natural language processing dual-stream architecture multimodal question answering attention mechanisms
Online Access:	https://ieeexplore.ieee.org/document/10811881/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841560546949726208
author	Faheem Shehzad Aniello Minutolo Massimo Esposito
author_facet	Faheem Shehzad Aniello Minutolo Massimo Esposito
author_sort	Faheem Shehzad
collection	DOAJ
description	In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.
format	Article
id	doaj-art-354ed089f6a74cfcab842d69274098f4
institution	Kabale University
issn	2169-3536
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-354ed089f6a74cfcab842d69274098f42025-01-04T00:00:57ZengIEEEIEEE Access2169-35362024-01-011219556119557410.1109/ACCESS.2024.352103210811881Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question AnsweringFaheem Shehzad0https://orcid.org/0009-0003-7204-183XAniello Minutolo1https://orcid.org/0000-0003-4744-3506Massimo Esposito2https://orcid.org/0000-0002-7196-7994Department of Electrical Engineering and Information Technology (DIETI), University of Naples “Federico II,”, Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyIn the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.https://ieeexplore.ieee.org/document/10811881/Visual question answering (VQA)transformer modelsnatural language processingdual-stream architecturemultimodal question answeringattention mechanisms
spellingShingle	Faheem Shehzad Aniello Minutolo Massimo Esposito Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering IEEE Access Visual question answering (VQA) transformer models natural language processing dual-stream architecture multimodal question answering attention mechanisms
title	Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_full	Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_fullStr	Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_full_unstemmed	Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_short	Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_sort	designing and evaluating a dual stream transformer based architecture for visual question answering
topic	Visual question answering (VQA) transformer models natural language processing dual-stream architecture multimodal question answering attention mechanisms
url	https://ieeexplore.ieee.org/document/10811881/
work_keys_str_mv	AT faheemshehzad designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering AT aniellominutolo designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering AT massimoesposito designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering

Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering

Similar Items