Dual modality prompt learning for visual question-grounded answering in robotic surgery

Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the inte...

Full description

Saved in:
Bibliographic Details
Main Authors: Yue Zhang, Wanshu Fan, Peixi Peng, Xin Yang, Dongsheng Zhou, Xiaopeng Wei
Format: Article
Language:English
Published: SpringerOpen 2024-04-01
Series:Visual Computing for Industry, Biomedicine, and Art
Subjects:
Online Access:https://doi.org/10.1186/s42492-024-00160-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846165571120398336
author Yue Zhang
Wanshu Fan
Peixi Peng
Xin Yang
Dongsheng Zhou
Xiaopeng Wei
author_facet Yue Zhang
Wanshu Fan
Peixi Peng
Xin Yang
Dongsheng Zhou
Xiaopeng Wei
author_sort Yue Zhang
collection DOAJ
description Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.
format Article
id doaj-art-aacb44afa9224e1da20c704418d07e75
institution Kabale University
issn 2524-4442
language English
publishDate 2024-04-01
publisher SpringerOpen
record_format Article
series Visual Computing for Industry, Biomedicine, and Art
spelling doaj-art-aacb44afa9224e1da20c704418d07e752024-11-17T12:08:20ZengSpringerOpenVisual Computing for Industry, Biomedicine, and Art2524-44422024-04-017111310.1186/s42492-024-00160-zDual modality prompt learning for visual question-grounded answering in robotic surgeryYue Zhang0Wanshu Fan1Peixi Peng2Xin Yang3Dongsheng Zhou4Xiaopeng Wei5National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyAbstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.https://doi.org/10.1186/s42492-024-00160-zPrompt learningVisual promptTextual promptGrounding-answeringVisual question answering
spellingShingle Yue Zhang
Wanshu Fan
Peixi Peng
Xin Yang
Dongsheng Zhou
Xiaopeng Wei
Dual modality prompt learning for visual question-grounded answering in robotic surgery
Visual Computing for Industry, Biomedicine, and Art
Prompt learning
Visual prompt
Textual prompt
Grounding-answering
Visual question answering
title Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_full Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_fullStr Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_full_unstemmed Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_short Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_sort dual modality prompt learning for visual question grounded answering in robotic surgery
topic Prompt learning
Visual prompt
Textual prompt
Grounding-answering
Visual question answering
url https://doi.org/10.1186/s42492-024-00160-z
work_keys_str_mv AT yuezhang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery
AT wanshufan dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery
AT peixipeng dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery
AT xinyang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery
AT dongshengzhou dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery
AT xiaopengwei dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery