Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition

Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Althou...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen Anh Tu, Nartay Aikyn, Nursultan Makhanov, Assanali Abu, Kok-Seng Wong, Min-Ho Lee
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10804801/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846107094869082112
author Nguyen Anh Tu
Nartay Aikyn
Nursultan Makhanov
Assanali Abu
Kok-Seng Wong
Min-Ho Lee
author_facet Nguyen Anh Tu
Nartay Aikyn
Nursultan Makhanov
Assanali Abu
Kok-Seng Wong
Min-Ho Lee
author_sort Nguyen Anh Tu
collection DOAJ
description Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.
format Article
id doaj-art-d49f2d27acf84cb8bc07a693e3319e9b
institution Kabale University
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d49f2d27acf84cb8bc07a693e3319e9b2024-12-27T00:01:01ZengIEEEIEEE Access2169-35362024-01-011219314119316410.1109/ACCESS.2024.351925410804801Benchmarking Federated Few-Shot Learning for Video-Based Action RecognitionNguyen Anh Tu0https://orcid.org/0000-0002-0650-8169Nartay Aikyn1https://orcid.org/0009-0002-5747-0989Nursultan Makhanov2Assanali Abu3Kok-Seng Wong4https://orcid.org/0000-0002-2029-7644Min-Ho Lee5Department of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanCollege of Engineering and Computer Science, VinUniversity, Hanoi, VietnamDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanFew-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.https://ieeexplore.ieee.org/document/10804801/Human action recognitionfew-shot learningfederated learningrepresentation learningfew-shot action recognition
spellingShingle Nguyen Anh Tu
Nartay Aikyn
Nursultan Makhanov
Assanali Abu
Kok-Seng Wong
Min-Ho Lee
Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
IEEE Access
Human action recognition
few-shot learning
federated learning
representation learning
few-shot action recognition
title Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_full Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_fullStr Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_full_unstemmed Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_short Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
title_sort benchmarking federated few shot learning for video based action recognition
topic Human action recognition
few-shot learning
federated learning
representation learning
few-shot action recognition
url https://ieeexplore.ieee.org/document/10804801/
work_keys_str_mv AT nguyenanhtu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition
AT nartayaikyn benchmarkingfederatedfewshotlearningforvideobasedactionrecognition
AT nursultanmakhanov benchmarkingfederatedfewshotlearningforvideobasedactionrecognition
AT assanaliabu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition
AT koksengwong benchmarkingfederatedfewshotlearningforvideobasedactionrecognition
AT minholee benchmarkingfederatedfewshotlearningforvideobasedactionrecognition