Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection

Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake...

Full description

Saved in:
Bibliographic Details
Main Authors: Hamid Heydarian, Marc T. P. Adam, Tracy L. Burrows, Megan E. Rollo
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9567689/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841563310523154432
author Hamid Heydarian
Marc T. P. Adam
Tracy L. Burrows
Megan E. Rollo
author_facet Hamid Heydarian
Marc T. P. Adam
Tracy L. Burrows
Megan E. Rollo
author_sort Hamid Heydarian
collection DOAJ
description Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.871$ </tex-math></inline-formula>) outperforms both individual video (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.855$ </tex-math></inline-formula>) and inertial (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.806$ </tex-math></inline-formula>) models. However, on the OREBA-SHA dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.873$ </tex-math></inline-formula>) fails to outperform the individual inertial model (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.895$ </tex-math></inline-formula>). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (<inline-formula> <tex-math notation="LaTeX">$p \lt $ </tex-math></inline-formula>.001).
format Article
id doaj-art-9ca7b659c73e4fc9894fb51d4e2f24d3
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-9ca7b659c73e4fc9894fb51d4e2f24d32025-01-03T00:01:59ZengIEEEIEEE Access2169-35362025-01-011364365510.1109/ACCESS.2021.31192539567689Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture DetectionHamid Heydarian0https://orcid.org/0000-0002-9824-5828Marc T. P. Adam1https://orcid.org/0000-0002-6036-4282Tracy L. Burrows2Megan E. Rollo3School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, AustraliaSchool of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, AustraliaPriority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW, AustraliaPriority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW, AustraliaRecent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.871$ </tex-math></inline-formula>) outperforms both individual video (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.855$ </tex-math></inline-formula>) and inertial (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.806$ </tex-math></inline-formula>) models. However, on the OREBA-SHA dataset, the max score fusion approach (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.873$ </tex-math></inline-formula>) fails to outperform the individual inertial model (<inline-formula> <tex-math notation="LaTeX">$F_{1} =0.895$ </tex-math></inline-formula>). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (<inline-formula> <tex-math notation="LaTeX">$p \lt $ </tex-math></inline-formula>.001).https://ieeexplore.ieee.org/document/9567689/Score-level fusiondecision-level fusionintake gesture detectiondeep leaninginertialaccelerometer
spellingShingle Hamid Heydarian
Marc T. P. Adam
Tracy L. Burrows
Megan E. Rollo
Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
IEEE Access
Score-level fusion
decision-level fusion
intake gesture detection
deep leaning
inertial
accelerometer
title Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
title_full Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
title_fullStr Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
title_full_unstemmed Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
title_short Exploring Score-Level and Decision-Level Fusion of Inertial and Video Data for Intake Gesture Detection
title_sort exploring score level and decision level fusion of inertial and video data for intake gesture detection
topic Score-level fusion
decision-level fusion
intake gesture detection
deep leaning
inertial
accelerometer
url https://ieeexplore.ieee.org/document/9567689/
work_keys_str_mv AT hamidheydarian exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection
AT marctpadam exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection
AT tracylburrows exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection
AT meganerollo exploringscorelevelanddecisionlevelfusionofinertialandvideodataforintakegesturedetection