Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and leng...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10811881/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841560546949726208 |
---|---|
author | Faheem Shehzad Aniello Minutolo Massimo Esposito |
author_facet | Faheem Shehzad Aniello Minutolo Massimo Esposito |
author_sort | Faheem Shehzad |
collection | DOAJ |
description | In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models. |
format | Article |
id | doaj-art-354ed089f6a74cfcab842d69274098f4 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-354ed089f6a74cfcab842d69274098f42025-01-04T00:00:57ZengIEEEIEEE Access2169-35362024-01-011219556119557410.1109/ACCESS.2024.352103210811881Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question AnsweringFaheem Shehzad0https://orcid.org/0009-0003-7204-183XAniello Minutolo1https://orcid.org/0000-0003-4744-3506Massimo Esposito2https://orcid.org/0000-0002-7196-7994Department of Electrical Engineering and Information Technology (DIETI), University of Naples “Federico II,”, Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyNational Research Council of Italy (CNR), Institute for High Performance Computing and Networking (ICAR), Naples, ItalyIn the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.https://ieeexplore.ieee.org/document/10811881/Visual question answering (VQA)transformer modelsnatural language processingdual-stream architecturemultimodal question answeringattention mechanisms |
spellingShingle | Faheem Shehzad Aniello Minutolo Massimo Esposito Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering IEEE Access Visual question answering (VQA) transformer models natural language processing dual-stream architecture multimodal question answering attention mechanisms |
title | Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering |
title_full | Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering |
title_fullStr | Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering |
title_full_unstemmed | Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering |
title_short | Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering |
title_sort | designing and evaluating a dual stream transformer based architecture for visual question answering |
topic | Visual question answering (VQA) transformer models natural language processing dual-stream architecture multimodal question answering attention mechanisms |
url | https://ieeexplore.ieee.org/document/10811881/ |
work_keys_str_mv | AT faheemshehzad designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering AT aniellominutolo designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering AT massimoesposito designingandevaluatingadualstreamtransformerbasedarchitectureforvisualquestionanswering |