BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering

Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applicat...

Full description

Saved in:
Bibliographic Details
Main Authors: Md. Shalha Mucha Bhuyan, Eftekhar Hossain, Khaleda Akhter Sathi, Md. Azad Hossain, M. Ali Akber Dewan
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10878995/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of &#x2248;3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>.
ISSN:2169-3536