InterAcT: A generic keypoints-based lightweight transformer model for recognition of human solo actions and interactions in aerial videos.
Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interaction...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0323314 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interactions, thus restricting their use to specific scenarios. Additionally, the need remains to devise lightweight and computationally efficient models to make them deployable in real-world applications. To this end, this paper presents a generic lightweight and computationally efficient Transformer network-based model, referred to as InterAcT, that relies on extracted bodily keypoints using YOLO v8 to recognize human solo actions as well as interactions in aerial videos. It features a lightweight architecture with 0.0709M parameters and 0.0389G flops, distinguishing it from the AcT models. An extensive performance evaluation has been performed on two publicly available aerial datasets: Drone Action and UT-Interaction, comprising a total of 18 classes including both solo actions and interactions. The model is optimized and trained on 80% train set, 10% validation set and its performance is evaluated on 10% test set achieving highly encouraging performance on multiple benchmarks, outperforming several state-of-the-art methods. Our model, with an accuracy of 0.9923 outperforms the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). It has the strength to recognize a large number of solo actions and two-person interaction classes both in aerial videos and footage from ground-level cameras (grayscale and RGB). |
|---|---|
| ISSN: | 1932-6203 |