An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning

Abstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content a...

Full description

Saved in:

Bibliographic Details
Main Authors:	M. Gowri Shankar, D. Surendran
Format:	Article
Language:	English
Published:	SpringerOpen 2025-01-01
Series:	EURASIP Journal on Image and Video Processing
Subjects:	Video captioning Similarity calculation Language description Policy and value network Graph attention network
Online Access:	https://doi.org/10.1186/s13640-024-00662-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841544355551117312
author	M. Gowri Shankar D. Surendran
author_facet	M. Gowri Shankar D. Surendran
author_sort	M. Gowri Shankar
collection	DOAJ
description	Abstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and eliminate irrelevant elements. Additionally, this redundancy often results in misalignment with equivalent visual semantics in the ground truth, further complicating the video captioning process. In response to these challenges, this research introduces the Graylag Deep Kookaburra Reinforcement Learning (GDKRL) framework, which enhances video captioning through a multi-stage process. First, object detection is performed using the single-shot multi-box detector with generalized intersection over union for accurate object tracking and similarity calculation. Next, the gazelle autoencoder extracts and fuses features from video frames, integrating complex visual and temporal information into a unified representation. The residual convolved dual sparse graph attention network then generates detailed and contextually rich language descriptions by applying dual sparse attention mechanisms and residual convolutions. Finally, hybrid graylag and kookaburra optimization refine the captioning process, producing comprehensive and precise textual descriptions of the video content. Extensive experiments on MSVD yielded 81.79 BLEU-4, 51.2 METEOR, 133.3 CIDEr and 81.7 ROUGE-L; VATEX achieved 62.29 BLEU-4, 51.2 METEOR, 110.2 CIDEr and 78.45 ROUGE-L; and the MSR-VTT dataset obtained 44.52 BLEU-4, 33.35 METEOR, 63.9 CIDEr and 68.9 ROUGE-L, demonstrating that the proposed technique significantly outperforms previous approaches and highlights its effectiveness.
format	Article
id	doaj-art-529dfca3221741aebc5655ad30e361e3
institution	Kabale University
issn	1687-5281
language	English
publishDate	2025-01-01
publisher	SpringerOpen
record_format	Article
series	EURASIP Journal on Image and Video Processing
spelling	doaj-art-529dfca3221741aebc5655ad30e361e32025-01-12T12:35:31ZengSpringerOpenEURASIP Journal on Image and Video Processing1687-52812025-01-012025112710.1186/s13640-024-00662-zAn effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement LearningM. Gowri Shankar0D. Surendran1Computer Science and Engineering, Government College of TechnologyInformation Technology, Karpagam College of EngineeringAbstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and eliminate irrelevant elements. Additionally, this redundancy often results in misalignment with equivalent visual semantics in the ground truth, further complicating the video captioning process. In response to these challenges, this research introduces the Graylag Deep Kookaburra Reinforcement Learning (GDKRL) framework, which enhances video captioning through a multi-stage process. First, object detection is performed using the single-shot multi-box detector with generalized intersection over union for accurate object tracking and similarity calculation. Next, the gazelle autoencoder extracts and fuses features from video frames, integrating complex visual and temporal information into a unified representation. The residual convolved dual sparse graph attention network then generates detailed and contextually rich language descriptions by applying dual sparse attention mechanisms and residual convolutions. Finally, hybrid graylag and kookaburra optimization refine the captioning process, producing comprehensive and precise textual descriptions of the video content. Extensive experiments on MSVD yielded 81.79 BLEU-4, 51.2 METEOR, 133.3 CIDEr and 81.7 ROUGE-L; VATEX achieved 62.29 BLEU-4, 51.2 METEOR, 110.2 CIDEr and 78.45 ROUGE-L; and the MSR-VTT dataset obtained 44.52 BLEU-4, 33.35 METEOR, 63.9 CIDEr and 68.9 ROUGE-L, demonstrating that the proposed technique significantly outperforms previous approaches and highlights its effectiveness.https://doi.org/10.1186/s13640-024-00662-zVideo captioningSimilarity calculationLanguage descriptionPolicy and value networkGraph attention network
spellingShingle	M. Gowri Shankar D. Surendran An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning EURASIP Journal on Image and Video Processing Video captioning Similarity calculation Language description Policy and value network Graph attention network
title	An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_full	An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_fullStr	An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_full_unstemmed	An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_short	An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_sort	effective video captioning based on language description using a novel graylag deep kookaburra reinforcement learning
topic	Video captioning Similarity calculation Language description Policy and value network Graph attention network
url	https://doi.org/10.1186/s13640-024-00662-z
work_keys_str_mv	AT mgowrishankar aneffectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning AT dsurendran aneffectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning AT mgowrishankar effectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning AT dsurendran effectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning

An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning

Similar Items