Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the di...

Full description

Saved in:
Bibliographic Details
Main Authors: Longbin Jin, Hyuntaek Jung, Hyo Jin Jon, Eun Yi Kim
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/9/1365
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849312831089934336
author Longbin Jin
Hyuntaek Jung
Hyo Jin Jon
Eun Yi Kim
author_facet Longbin Jin
Hyuntaek Jung
Hyo Jin Jon
Eun Yi Kim
author_sort Longbin Jin
collection DOAJ
description Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.
format Article
id doaj-art-5dc30ba2e3694784b7e391a1599cf6bc
institution Kabale University
issn 2227-7390
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-5dc30ba2e3694784b7e391a1599cf6bc2025-08-20T03:52:57ZengMDPI AGMathematics2227-73902025-04-01139136510.3390/math13091365Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language ModelsLongbin Jin0Hyuntaek Jung1Hyo Jin Jon2Eun Yi Kim3Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaLarge-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.https://www.mdpi.com/2227-7390/13/9/1365visual promptvideo recognitionvisual-language model
spellingShingle Longbin Jin
Hyuntaek Jung
Hyo Jin Jon
Eun Yi Kim
Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
Mathematics
visual prompt
video recognition
visual-language model
title Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_full Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_fullStr Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_full_unstemmed Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_short Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_sort watermark and trademark prompts boost video action recognition in visual language models
topic visual prompt
video recognition
visual-language model
url https://www.mdpi.com/2227-7390/13/9/1365
work_keys_str_mv AT longbinjin watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels
AT hyuntaekjung watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels
AT hyojinjon watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels
AT eunyikim watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels