MFEAM: Multi-View Feature Enhanced Attention Model for Image Captioning

Image captioning plays a crucial role in aligning visual content with natural language, serving as a key step toward effective cross-modal understanding. Transformer has become the dominant language model in image captioning. Existing Transformer-based models seldom highlight important features from...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang Cui, Juan Zhang
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/15/8368
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Image captioning plays a crucial role in aligning visual content with natural language, serving as a key step toward effective cross-modal understanding. Transformer has become the dominant language model in image captioning. Existing Transformer-based models seldom highlight important features from multiple views in the use of self-attention. In this paper, we propose MFEAM, an innovative network that leverages the multi-view feature enhanced attention. To accurately represent the entangled features of vision and text, the teacher model employs the multi-view feature enhanced attention to guide the student model training through knowledge distillation and model averaging from both visual and textual views. To mitigate the impact of excessive feature enhancement, the student model divides the decoding layer into two groups, which separately process instance features and the relationships between instances. Experimental results demonstrate that MFEAM attains competitive performance on the MSCOCO (Microsoft Common Objects in Context) when trained without leveraging external data.
ISSN:2076-3417