Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the di...

Full description

Saved in:

Bibliographic Details
Main Authors:	Longbin Jin, Hyuntaek Jung, Hyo Jin Jon, Eun Yi Kim
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	Mathematics
Subjects:	visual prompt video recognition visual-language model
Online Access:	https://www.mdpi.com/2227-7390/13/9/1365
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849312831089934336
author	Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim
author_facet	Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim
author_sort	Longbin Jin
collection	DOAJ
description	Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.
format	Article
id	doaj-art-5dc30ba2e3694784b7e391a1599cf6bc
institution	Kabale University
issn	2227-7390
language	English
publishDate	2025-04-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj-art-5dc30ba2e3694784b7e391a1599cf6bc2025-08-20T03:52:57ZengMDPI AGMathematics2227-73902025-04-01139136510.3390/math13091365Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language ModelsLongbin Jin0Hyuntaek Jung1Hyo Jin Jon2Eun Yi Kim3Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaLarge-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.https://www.mdpi.com/2227-7390/13/9/1365visual promptvideo recognitionvisual-language model
spellingShingle	Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models Mathematics visual prompt video recognition visual-language model
title	Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_full	Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_fullStr	Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_full_unstemmed	Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_short	Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
title_sort	watermark and trademark prompts boost video action recognition in visual language models
topic	visual prompt video recognition visual-language model
url	https://www.mdpi.com/2227-7390/13/9/1365
work_keys_str_mv	AT longbinjin watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT hyuntaekjung watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT hyojinjon watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT eunyikim watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels

Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

Similar Items