Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition

Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Althou...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nguyen Anh Tu, Nartay Aikyn, Nursultan Makhanov, Assanali Abu, Kok-Seng Wong, Min-Ho Lee
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Human action recognition few-shot learning federated learning representation learning few-shot action recognition
Online Access:	https://ieeexplore.ieee.org/document/10804801/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846107094869082112
author	Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee
author_facet	Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee
author_sort	Nguyen Anh Tu
collection	DOAJ
description	Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.
format	Article
id	doaj-art-d49f2d27acf84cb8bc07a693e3319e9b
institution	Kabale University
issn	2169-3536
language	English
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-d49f2d27acf84cb8bc07a693e3319e9b2024-12-27T00:01:01ZengIEEEIEEE Access2169-35362024-01-011219314119316410.1109/ACCESS.2024.351925410804801Benchmarking Federated Few-Shot Learning for Video-Based Action RecognitionNguyen Anh Tu0https://orcid.org/0000-0002-0650-8169Nartay Aikyn1https://orcid.org/0009-0002-5747-0989Nursultan Makhanov2Assanali Abu3Kok-Seng Wong4https://orcid.org/0000-0002-2029-7644Min-Ho Lee5Department of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanCollege of Engineering and Computer Science, VinUniversity, Hanoi, VietnamDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanFew-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.https://ieeexplore.ieee.org/document/10804801/Human action recognitionfew-shot learningfederated learningrepresentation learningfew-shot action recognition
spellingShingle	Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition IEEE Access Human action recognition few-shot learning federated learning representation learning few-shot action recognition
title	Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_full	Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_fullStr	Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_full_unstemmed	Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_short	Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_sort	benchmarking federated few shot learning for video based action recognition
topic	Human action recognition few-shot learning federated learning representation learning few-shot action recognition
url	https://ieeexplore.ieee.org/document/10804801/
work_keys_str_mv	AT nguyenanhtu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT nartayaikyn benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT nursultanmakhanov benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT assanaliabu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT koksengwong benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT minholee benchmarkingfederatedfewshotlearningforvideobasedactionrecognition

Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition

Similar Items