Dual modality prompt learning for visual question-grounded answering in robotic surgery
Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the inte...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2024-04-01
|
| Series: | Visual Computing for Industry, Biomedicine, and Art |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s42492-024-00160-z |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846165571120398336 |
|---|---|
| author | Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei |
| author_facet | Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei |
| author_sort | Yue Zhang |
| collection | DOAJ |
| description | Abstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets. |
| format | Article |
| id | doaj-art-aacb44afa9224e1da20c704418d07e75 |
| institution | Kabale University |
| issn | 2524-4442 |
| language | English |
| publishDate | 2024-04-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | Visual Computing for Industry, Biomedicine, and Art |
| spelling | doaj-art-aacb44afa9224e1da20c704418d07e752024-11-17T12:08:20ZengSpringerOpenVisual Computing for Industry, Biomedicine, and Art2524-44422024-04-017111310.1186/s42492-024-00160-zDual modality prompt learning for visual question-grounded answering in robotic surgeryYue Zhang0Wanshu Fan1Peixi Peng2Xin Yang3Dongsheng Zhou4Xiaopeng Wei5National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversityNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyNational and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian UniversitySchool of Computer Science and Technology, Dalian University of TechnologyAbstract With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.https://doi.org/10.1186/s42492-024-00160-zPrompt learningVisual promptTextual promptGrounding-answeringVisual question answering |
| spellingShingle | Yue Zhang Wanshu Fan Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei Dual modality prompt learning for visual question-grounded answering in robotic surgery Visual Computing for Industry, Biomedicine, and Art Prompt learning Visual prompt Textual prompt Grounding-answering Visual question answering |
| title | Dual modality prompt learning for visual question-grounded answering in robotic surgery |
| title_full | Dual modality prompt learning for visual question-grounded answering in robotic surgery |
| title_fullStr | Dual modality prompt learning for visual question-grounded answering in robotic surgery |
| title_full_unstemmed | Dual modality prompt learning for visual question-grounded answering in robotic surgery |
| title_short | Dual modality prompt learning for visual question-grounded answering in robotic surgery |
| title_sort | dual modality prompt learning for visual question grounded answering in robotic surgery |
| topic | Prompt learning Visual prompt Textual prompt Grounding-answering Visual question answering |
| url | https://doi.org/10.1186/s42492-024-00160-z |
| work_keys_str_mv | AT yuezhang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT wanshufan dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT peixipeng dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT xinyang dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT dongshengzhou dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery AT xiaopengwei dualmodalitypromptlearningforvisualquestiongroundedansweringinroboticsurgery |