Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records
Abstract Background Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizin...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2024-12-01
|
| Series: | BMC Medical Research Methodology |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12874-024-02422-z |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Background Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizing EMR could be used to develop a new model for detecting prevalent CVD in clinical settings. Methods We included 155,894 patients (aged ≥ 18 years) discharged between January 2014 and July 2022, from Xuhui Hospital, Shanghai, China, including 64,916 CVD cases and 90,979 non-CVD cases. K-means clustering was used to generate the clustering models with k = 2, 4, and 8 as predetermined number of clusters k = 2, 4, and 8. Bayesian theorem was used to estimate the models’ predictive accuracy. Results The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively. Similarly, the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively. After reducing from 19 dimensions to 2 dimensions by principal component analysis, significant separation was observed for CVD cases and non-CVD cases in both training and testing sets. Conclusion Our findings indicate that the utilization of EMR data can support the development of a robust model for CVD detection through an unsupervised ML approach. Further investigation using longitudinal design is needed to refine the model for its applications in clinical settings. |
|---|---|
| ISSN: | 1471-2288 |