Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the di...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Mathematics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7390/13/9/1365 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849312831089934336 |
|---|---|
| author | Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim |
| author_facet | Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim |
| author_sort | Longbin Jin |
| collection | DOAJ |
| description | Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available. |
| format | Article |
| id | doaj-art-5dc30ba2e3694784b7e391a1599cf6bc |
| institution | Kabale University |
| issn | 2227-7390 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Mathematics |
| spelling | doaj-art-5dc30ba2e3694784b7e391a1599cf6bc2025-08-20T03:52:57ZengMDPI AGMathematics2227-73902025-04-01139136510.3390/math13091365Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language ModelsLongbin Jin0Hyuntaek Jung1Hyo Jin Jon2Eun Yi Kim3Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaArtificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of KoreaLarge-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.https://www.mdpi.com/2227-7390/13/9/1365visual promptvideo recognitionvisual-language model |
| spellingShingle | Longbin Jin Hyuntaek Jung Hyo Jin Jon Eun Yi Kim Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models Mathematics visual prompt video recognition visual-language model |
| title | Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models |
| title_full | Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models |
| title_fullStr | Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models |
| title_full_unstemmed | Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models |
| title_short | Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models |
| title_sort | watermark and trademark prompts boost video action recognition in visual language models |
| topic | visual prompt video recognition visual-language model |
| url | https://www.mdpi.com/2227-7390/13/9/1365 |
| work_keys_str_mv | AT longbinjin watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT hyuntaekjung watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT hyojinjon watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels AT eunyikim watermarkandtrademarkpromptsboostvideoactionrecognitioninvisuallanguagemodels |