Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9567689/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841563310523154432 |
---|---|
author | Hamid Heydarian Marc T. P. Adam Tracy L. Burrows Megan E. Rollo |
author_facet | Hamid Heydarian Marc T. P. Adam Tracy L. Burrows Megan E. Rollo |
author_sort | Hamid Heydarian |
collection | DOAJ |
description | Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.871$ </tex-math></inline-formula>) outperforms both individual video (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.855$ </tex-math></inline-formula>) and inertial (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.806$ </tex-math></inline-formula>) models. However, on the OREBA-SHA dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.873$ </tex-math></inline-formula>) fails to outperform the individual inertial model (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.895$ </tex-math></inline-formula>). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (<inline-formula> <tex-math notation="LaTeX">$p \lt $ </tex-math></inline-formula>.001). |
format | Article |
id | doaj-art-9ca7b659c73e4fc9894fb51d4e2f24d3 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-9ca7b659c73e4fc9894fb51d4e2f24d32025-01-03T00:01:59ZengIEEEIEEE Access2169-35362025-01-011364365510.1109/ACCESS.2021.31192539567689Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture DetectionHamid Heydarian0https://orcid.org/0000-0002-9824-5828Marc T. P. Adam1https://orcid.org/0000-0002-6036-4282Tracy L. Burrows2Megan E. Rollo3School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, AustraliaSchool of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, AustraliaPriority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW, AustraliaPriority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW, AustraliaRecent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.871$ </tex-math></inline-formula>) outperforms both individual video (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.855$ </tex-math></inline-formula>) and inertial (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.806$ </tex-math></inline-formula>) models. However, on the OREBA-SHA dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.873$ </tex-math></inline-formula>) fails to outperform the individual inertial model (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.895$ </tex-math></inline-formula>). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (<inline-formula> <tex-math notation="LaTeX">$p \lt $ </tex-math></inline-formula>.001).https://ieeexplore.ieee.org/document/9567689/Score-level fusiondecision-level fusionintake gesture detectiondeep leaninginertialaccelerometer |
spellingShingle | Hamid Heydarian Marc T. P. Adam Tracy L. Burrows Megan E. Rollo Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection IEEE Access Score-level fusion decision-level fusion intake gesture detection deep leaning inertial accelerometer |
title | Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection |
title_full | Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection |
title_fullStr | Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection |
title_full_unstemmed | Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection |
title_short | Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection |
title_sort | exploring score level and decision level fusion of inertial and video data for intake gesture detection |
topic | Score-level fusion decision-level fusion intake gesture detection deep leaning inertial accelerometer |
url | https://ieeexplore.ieee.org/document/9567689/ |
work_keys_str_mv | AT hamidheydarian exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection AT marctpadam exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection AT tracylburrows exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection AT meganerollo exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection |