Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition
Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Althou...
        Saved in:
      
    
          | Main Authors: | , , , , , | 
|---|---|
| Format: | Article | 
| Language: | English | 
| Published: | 
            IEEE
    
        2024-01-01
     | 
| Series: | IEEE Access | 
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10804801/ | 
| Tags: | 
       Add Tag    
     
      No Tags, Be the first to tag this record!
   
 | 
| _version_ | 1846107094869082112 | 
    
|---|---|
| author | Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee  | 
    
| author_facet | Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee  | 
    
| author_sort | Nguyen Anh Tu | 
    
| collection | DOAJ | 
    
| description | Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition. | 
    
| format | Article | 
    
| id | doaj-art-d49f2d27acf84cb8bc07a693e3319e9b | 
    
| institution | Kabale University | 
    
| issn | 2169-3536 | 
    
| language | English | 
    
| publishDate | 2024-01-01 | 
    
| publisher | IEEE | 
    
| record_format | Article | 
    
| series | IEEE Access | 
    
| spelling | doaj-art-d49f2d27acf84cb8bc07a693e3319e9b2024-12-27T00:01:01ZengIEEEIEEE Access2169-35362024-01-011219314119316410.1109/ACCESS.2024.351925410804801Benchmarking Federated Few-Shot Learning for Video-Based Action RecognitionNguyen Anh Tu0https://orcid.org/0000-0002-0650-8169Nartay Aikyn1https://orcid.org/0009-0002-5747-0989Nursultan Makhanov2Assanali Abu3Kok-Seng Wong4https://orcid.org/0000-0002-2029-7644Min-Ho Lee5Department of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanCollege of Engineering and Computer Science, VinUniversity, Hanoi, VietnamDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Astana, KazakhstanFew-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.https://ieeexplore.ieee.org/document/10804801/Human action recognitionfew-shot learningfederated learningrepresentation learningfew-shot action recognition | 
    
| spellingShingle | Nguyen Anh Tu Nartay Aikyn Nursultan Makhanov Assanali Abu Kok-Seng Wong Min-Ho Lee Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition IEEE Access Human action recognition few-shot learning federated learning representation learning few-shot action recognition  | 
    
| title | Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition | 
    
| title_full | Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition | 
    
| title_fullStr | Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition | 
    
| title_full_unstemmed | Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition | 
    
| title_short | Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition | 
    
| title_sort | benchmarking federated few shot learning for video based action recognition | 
    
| topic | Human action recognition few-shot learning federated learning representation learning few-shot action recognition  | 
    
| url | https://ieeexplore.ieee.org/document/10804801/ | 
    
| work_keys_str_mv | AT nguyenanhtu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT nartayaikyn benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT nursultanmakhanov benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT assanaliabu benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT koksengwong benchmarkingfederatedfewshotlearningforvideobasedactionrecognition AT minholee benchmarkingfederatedfewshotlearningforvideobasedactionrecognition  |