Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction

Background: Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited. Objectives: This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against...

Full description

Saved in:
Bibliographic Details
Main Authors: Lise M. Bjerre, Cayden Peixoto, Rawan Alkurd, Robert Talarico, Rami Abielmona
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Global Epidemiology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2590113324000348
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846127246713028608
author Lise M. Bjerre
Cayden Peixoto
Rawan Alkurd
Robert Talarico
Rami Abielmona
author_facet Lise M. Bjerre
Cayden Peixoto
Rawan Alkurd
Robert Talarico
Rami Abielmona
author_sort Lise M. Bjerre
collection DOAJ
description Background: Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited. Objectives: This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data. Methods: Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation. Results: The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches. Conclusions: Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.
format Article
id doaj-art-90cd181a34b448cf91b0d9d86b49ac8e
institution Kabale University
issn 2590-1133
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Global Epidemiology
spelling doaj-art-90cd181a34b448cf91b0d9d86b49ac8e2024-12-12T05:22:31ZengElsevierGlobal Epidemiology2590-11332024-12-018100168Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case predictionLise M. Bjerre0Cayden Peixoto1Rawan Alkurd2Robert Talarico3Rami Abielmona4Institut du Savoir Montfort, 713, chemin Montréal, Ottawa, Ontario K1K 0T2, Canada; University of Ottawa, Faculty of Medicine, Department of Family Medicine, 201-600 Peter-Morand Crescent, Ottawa ON, K1G 5Z3, Canada; Institute for Clinical and Evaluative Sciences (ICES), 1053 Carling Avenue, Box 684, Administrative Services Building, 1st Floor, Ottawa, Ontario K1Y 4E9, Canada; Corresponding author at: 713, chemin Montréal, Ottawa, Ontario K1K 0T2, Canada.Institut du Savoir Montfort, 713, chemin Montréal, Ottawa, Ontario K1K 0T2, CanadaLarus Technologies Corporation, 170 Laurier Ave West, Suite 310 Ottawa, Ontario K1P 5V5, CanadaInstitute for Clinical and Evaluative Sciences (ICES), 1053 Carling Avenue, Box 684, Administrative Services Building, 1st Floor, Ottawa, Ontario K1Y 4E9, Canada; Ottawa Hospital Research Institute, 501 Smyth Box 511, Ottawa ON, K1H 8L6, CanadaLarus Technologies Corporation, 170 Laurier Ave West, Suite 310 Ottawa, Ontario K1P 5V5, Canada; University of Ottawa, Faculty of Engineering, 800 King Edward Ave, Ottawa, ON K1N 6N5, CanadaBackground: Research comparing artificial intelligence and machine learning (AI/ML) methods with classical statistical methods applied to large population health databases is limited. Objectives: This retrospective cohort study aimed to compare the predictive performance of AI/ML algorithms against conventional multivariate logistic regression models using linked health administrative data. Methods: Using Ontario's population health databases, we created a cohort of residents of the city of Ottawa, Ontario, who underwent a PCR test for COVID-19 between March 10, 2020, and May 13, 2021. Using demographic, socio-economic and health data (including COVID-19 PCR test results and available, symptom data), we developed predictive models for the purpose of COVID-19 case identification using the following approaches: classical multivariate logistic regression (LR); deep neural network (DNN); random forest (RF); and gradient boosting trees (GBT). Model performance comparisons were made using the area under the curve (AUC) swarm plot for 10-fold cross-validation. Results: The cohort consisted of n = 351,248 Ottawa residents tested for COVID-19 during the study period. Among whom, a total of n = 883,879 unique COVID-19 tests were performed (2.6 % positive test results). Inclusion of COVID-19 symptoms data in the analysis improved model performance and variable predictive value across all tested models (p < 0.0001), with the 10-fold cross-validation AUC increasing to near or over 0.7 in all models when symptoms data were included. In various pairwise comparisons, the GBT method had the highest predictive ability (AUC = 0.796 ± 0.017), significantly outperforming multivariate logistic regression and the other AI/ML approaches. Conclusions: Conventional multivariate regression-based models are better than some and worse than other machine learning algorithms to provide good predictive accuracy in a moderate dataset with a reasonable number of features. However, whenever possible, the AI/ML GBT approach should be considered.http://www.sciencedirect.com/science/article/pii/S2590113324000348Artificial intelligenceMachine learningCOVID-19Logistic regressionPredictive modelingGradient boosting trees
spellingShingle Lise M. Bjerre
Cayden Peixoto
Rawan Alkurd
Robert Talarico
Rami Abielmona
Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
Global Epidemiology
Artificial intelligence
Machine learning
COVID-19
Logistic regression
Predictive modeling
Gradient boosting trees
title Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
title_full Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
title_fullStr Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
title_full_unstemmed Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
title_short Comparing AI/ML approaches and classical regression for predictive modeling using large population health databases: Applications to COVID-19 case prediction
title_sort comparing ai ml approaches and classical regression for predictive modeling using large population health databases applications to covid 19 case prediction
topic Artificial intelligence
Machine learning
COVID-19
Logistic regression
Predictive modeling
Gradient boosting trees
url http://www.sciencedirect.com/science/article/pii/S2590113324000348
work_keys_str_mv AT lisembjerre comparingaimlapproachesandclassicalregressionforpredictivemodelingusinglargepopulationhealthdatabasesapplicationstocovid19caseprediction
AT caydenpeixoto comparingaimlapproachesandclassicalregressionforpredictivemodelingusinglargepopulationhealthdatabasesapplicationstocovid19caseprediction
AT rawanalkurd comparingaimlapproachesandclassicalregressionforpredictivemodelingusinglargepopulationhealthdatabasesapplicationstocovid19caseprediction
AT roberttalarico comparingaimlapproachesandclassicalregressionforpredictivemodelingusinglargepopulationhealthdatabasesapplicationstocovid19caseprediction
AT ramiabielmona comparingaimlapproachesandclassicalregressionforpredictivemodelingusinglargepopulationhealthdatabasesapplicationstocovid19caseprediction