Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering

In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and leng...

Full description

Saved in:
Bibliographic Details
Main Authors: Faheem Shehzad, Aniello Minutolo, Massimo Esposito
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10811881/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841560546949726208
author Faheem Shehzad
Aniello Minutolo
Massimo Esposito
author_facet Faheem Shehzad
Aniello Minutolo
Massimo Esposito
author_sort Faheem Shehzad
collection DOAJ
description In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.
format Article
id doaj-art-354ed089f6a74cfcab842d69274098f4
institution Kabale University
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-354ed089f6a74cfcab842d69274098f42025-01-04T00:00:57ZengIEEEIEEE Access2169-35362024-01-011219556119557410.1109/ACCESS.2024.352103210811881Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question AnsweringFaheem Shehzad0https://orcid.org/0009-0003-7204-183XAniello Minutolo1https://orcid.org/0000-0003-4744-3506Massimo Esposito2https://orcid.org/0000-0002-7196-7994Department of Electrical Engineering and Information Technology (DIETI), University of Naples “Federico II,”, Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyIn the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.https://ieeexplore.ieee.org/document/10811881/Visual question answering (VQA)transformer modelsnatural language processingdual-stream architecturemultimodal question answeringattention mechanisms
spellingShingle Faheem Shehzad
Aniello Minutolo
Massimo Esposito
Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
IEEE Access
Visual question answering (VQA)
transformer models
natural language processing
dual-stream architecture
multimodal question answering
attention mechanisms
title Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_full Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_fullStr Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_full_unstemmed Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_short Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
title_sort designing and evaluating a dual stream transformer based architecture for visual question answering
topic Visual question answering (VQA)
transformer models
natural language processing
dual-stream architecture
multimodal question answering
attention mechanisms
url https://ieeexplore.ieee.org/document/10811881/
work_keys_str_mv AT faheemshehzad designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering
AT aniellominutolo designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering
AT massimoesposito designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering