Dual modality prompt learning for visual question-grounded answering in robotic surgery

Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the inte...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yue Zhang, Wanshu Fan, Peixi Peng, Xin Yang, Dongsheng Zhou, Xiaopeng Wei
Format:	Article
Language:	English
Published:	SpringerOpen 2024-04-01
Series:	Visual Computing for Industry, Biomedicine, and Art
Subjects:	Prompt learning Visual prompt Textual prompt Grounding-answering Visual question answering
Online Access:	https://doi.org/10.1186/s42492-024-00160-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846165571120398336
author	Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei
author_facet	Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei
author_sort	Yue Zhang
collection	DOAJ
description	Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.
format	Article
id	doaj-art-aacb44afa9224e1da20c704418d07e75
institution	Kabale University
issn	2524-4442
language	English
publishDate	2024-04-01
publisher	SpringerOpen
record_format	Article
series	Visual Computing for Industry, Biomedicine, and Art
spelling	doaj-art-aacb44afa9224e1da20c704418d07e752024-11-17T12:08:20ZengSpringerOpenVisual Computing for Industry, Biomedicine, and Art2524-44422024-04-017111310.1186/s42492-024-00160-zDual modality prompt learning for visual question-grounded answering in robotic surgeryYue Zhang0Wanshu Fan1Peixi Peng2Xin Yang3Dongsheng Zhou4Xiaopeng Wei5National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyAbstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.https://doi.org/10.1186/s42492-024-00160-zPrompt learningVisual promptTextual promptGrounding-answeringVisual question answering
spellingShingle	Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei Dual modality prompt learning for visual question-grounded answering in robotic surgery Visual Computing for Industry, Biomedicine, and Art Prompt learning Visual prompt Textual prompt Grounding-answering Visual question answering
title	Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_full	Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_fullStr	Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_full_unstemmed	Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_short	Dual modality prompt learning for visual question-grounded answering in robotic surgery
title_sort	dual modality prompt learning for visual question grounded answering in robotic surgery
topic	Prompt learning Visual prompt Textual prompt Grounding-answering Visual question answering
url	https://doi.org/10.1186/s42492-024-00160-z
work_keys_str_mv	AT yuezhang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT wanshufan dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT peixipeng dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT xinyang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT dongshengzhou dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT xiaopengwei dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery

Dual modality prompt learning for visual question-grounded answering in robotic surgery

Similar Items