An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning

Abstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content a...

Full description

Saved in:
Bibliographic Details
Main Authors: M. Gowri Shankar, D. Surendran
Format: Article
Language:English
Published: SpringerOpen 2025-01-01
Series:EURASIP Journal on Image and Video Processing
Subjects:
Online Access:https://doi.org/10.1186/s13640-024-00662-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544355551117312
author M. Gowri Shankar
D. Surendran
author_facet M. Gowri Shankar
D. Surendran
author_sort M. Gowri Shankar
collection DOAJ
description Abstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and eliminate irrelevant elements. Additionally, this redundancy often results in misalignment with equivalent visual semantics in the ground truth, further complicating the video captioning process. In response to these challenges, this research introduces the Graylag Deep Kookaburra Reinforcement Learning (GDKRL) framework, which enhances video captioning through a multi-stage process. First, object detection is performed using the single-shot multi-box detector with generalized intersection over union for accurate object tracking and similarity calculation. Next, the gazelle autoencoder extracts and fuses features from video frames, integrating complex visual and temporal information into a unified representation. The residual convolved dual sparse graph attention network then generates detailed and contextually rich language descriptions by applying dual sparse attention mechanisms and residual convolutions. Finally, hybrid graylag and kookaburra optimization refine the captioning process, producing comprehensive and precise textual descriptions of the video content. Extensive experiments on MSVD yielded 81.79 BLEU-4, 51.2 METEOR, 133.3 CIDEr and 81.7 ROUGE-L; VATEX achieved 62.29 BLEU-4, 51.2 METEOR, 110.2 CIDEr and 78.45 ROUGE-L; and the MSR-VTT dataset obtained 44.52 BLEU-4, 33.35 METEOR, 63.9 CIDEr and 68.9 ROUGE-L, demonstrating that the proposed technique significantly outperforms previous approaches and highlights its effectiveness.
format Article
id doaj-art-529dfca3221741aebc5655ad30e361e3
institution Kabale University
issn 1687-5281
language English
publishDate 2025-01-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Image and Video Processing
spelling doaj-art-529dfca3221741aebc5655ad30e361e32025-01-12T12:35:31ZengSpringerOpenEURASIP Journal on Image and Video Processing1687-52812025-01-012025112710.1186/s13640-024-00662-zAn effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement LearningM. Gowri Shankar0D. Surendran1Computer Science and Engineering, Government College of TechnologyInformation Technology, Karpagam College of EngineeringAbstract Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and eliminate irrelevant elements. Additionally, this redundancy often results in misalignment with equivalent visual semantics in the ground truth, further complicating the video captioning process. In response to these challenges, this research introduces the Graylag Deep Kookaburra Reinforcement Learning (GDKRL) framework, which enhances video captioning through a multi-stage process. First, object detection is performed using the single-shot multi-box detector with generalized intersection over union for accurate object tracking and similarity calculation. Next, the gazelle autoencoder extracts and fuses features from video frames, integrating complex visual and temporal information into a unified representation. The residual convolved dual sparse graph attention network then generates detailed and contextually rich language descriptions by applying dual sparse attention mechanisms and residual convolutions. Finally, hybrid graylag and kookaburra optimization refine the captioning process, producing comprehensive and precise textual descriptions of the video content. Extensive experiments on MSVD yielded 81.79 BLEU-4, 51.2 METEOR, 133.3 CIDEr and 81.7 ROUGE-L; VATEX achieved 62.29 BLEU-4, 51.2 METEOR, 110.2 CIDEr and 78.45 ROUGE-L; and the MSR-VTT dataset obtained 44.52 BLEU-4, 33.35 METEOR, 63.9 CIDEr and 68.9 ROUGE-L, demonstrating that the proposed technique significantly outperforms previous approaches and highlights its effectiveness.https://doi.org/10.1186/s13640-024-00662-zVideo captioningSimilarity calculationLanguage descriptionPolicy and value networkGraph attention network
spellingShingle M. Gowri Shankar
D. Surendran
An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
EURASIP Journal on Image and Video Processing
Video captioning
Similarity calculation
Language description
Policy and value network
Graph attention network
title An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_full An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_fullStr An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_full_unstemmed An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_short An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning
title_sort effective video captioning based on language description using a novel graylag deep kookaburra reinforcement learning
topic Video captioning
Similarity calculation
Language description
Policy and value network
Graph attention network
url https://doi.org/10.1186/s13640-024-00662-z
work_keys_str_mv AT mgowrishankar aneffectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning
AT dsurendran aneffectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning
AT mgowrishankar effectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning
AT dsurendran effectivevideocaptioningbasedonlanguagedescriptionusinganovelgraylagdeepkookaburrareinforcementlearning