BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applicat...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10878995/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (<monospace>MCRAN</monospace>), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories. The benchmark and source code is available at <monospace><uri>https://github.com/eftekhar-hossain/Bengali-VQA</uri></monospace>. |
|---|---|
| ISSN: | 2169-3536 |