Improving Cardiovascular Disease Prediction by Integrating Imputation, Imbalance Resampling, and Feature Selection Techniques into Machine Learning Model

Fadlan Hamid Alfebi(1*), Mila Desi Anasanti(2)

(1) Program Studi S2 Ilmu Komputer, Universitas Nusa Mandiri, Jakarta
(2) Program Studi S2 Ilmu Komputer, Universitas Nusa Mandiri, Jakarta; Department of Information Studies, University College London, London; Bart and London Genome Center, Queen Mary University of London, London
(*) Corresponding Author


Cardiovascular disease (CVD) is the leading cause of death worldwide. Primary prevention is by early prediction of the disease onset. Using laboratory data from the National Health and Nutrition Examination Survey (NHANES) in 2017-2020 timeframe (N= 7.974), we tested the ability of machine learning (ML) algorithms to classify individuals at risk. The ML models were evaluated based on their classification performances after comparing four imputation, three imbalance resampling, and three feature selection techniques.

Due to its popularity, we utilized decision tree (DT) as the baseline. Integration of multiple imputation by chained equation (MICE) and synthetic minority oversampling with Tomek link down-sampling (SMOTETomek) into the model improved the area under the curve-receiver operating characteristics (AUC-ROC) from 57% to 83%. Applying simultaneous perturbation feature selection and ranking (spFSR) reduced the feature predictors from 144 to 30 features and the computational time by 22%. The best techniques were applied to six ML models, resulting in Xtreme gradient boosting (XGBoost) achieving the highest accuracy of 93% and AUC-ROC of 89%.

The accuracy of our ML model in predicting CVD outperforms those from previous studies. We also highlight the important causes of CVD, which might be investigated further for potential effects on electronic health records.



machine learning; cardiovascular disease; imputation; resampling; feature selection

Full Text:



[1] Y. Ruan et al., “Cardiovascular disease (CVD) and associated risk factors among older adults in six low-and middle-income countries: Results from SAGE Wave 1,” BMC Public Health, vol. 18, no. 1, p. 778, Jun. 2018, doi: 10.1186/s12889-018-5653-9.

[2] W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes,” BMC Med. Inform. Decis. Mak., vol. 10, no. 1, p. 16, 2010, doi: 10.1186/1472-6947-10-16.

[3] K. S. Yew and E. Cheng, “Acute stroke diagnosis,” Am. Fam. Physician, vol. 80, no. 1, p. 33, 2009.

[4] C. Kreatsoulas, H. S. Shannon, M. Giacomini, J. L. Velianou, and S. S. Anand, “Reconstructing angina: cardiac symptoms are the same in women and men,” JAMA Intern. Med., vol. 173, no. 9, pp. 829–833, 2013.

[5] I. Kirchberger et al., “Patient‐reported symptoms in acute myocardial infarction: differences related to ST‐segment elevation: the MONICA/KORA Myocardial Infarction Registry,” J. Intern. Med., vol. 270, no. 1, pp. 58–64, 2011.

[6] J. Robson, L. Ayerbe, R. Mathur, J. Addo, and A. Wragg, “Clinical value of chest pain presentation and prodromes on the assessment of cardiovascular disease: a cohort study,” BMJ Open, vol. 5, no. 4, p. e007251, 2015.

[7] K. Chayakrit, Z. HongJu, W. Zhen, A. Mehmet, and K. Takeshi, “Artificial Intelligence in Precision Cardiovascular Medicine,” J. Am. Coll. Cardiol., vol. 69, no. 21, pp. 2657–2664, May 2017, doi: 10.1016/j.jacc.2017.03.571.

[8] A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” N. Engl. J. Med., vol. 380, no. 14, pp. 1347–1358, 2019.

[9] J. Patel, D. TejalUpadhyay, and S. Patel, “Heart disease prediction using machine learning and data mining technique,” Hear. Dis., vol. 7, no. 1, pp. 129–137, 2015.

[10] A. Singh and R. Kumar, “Heart disease prediction using machine learning algorithms,” in 2020 international conference on electrical and electronics engineering (ICE3), 2020, pp. 452–457.

[11] M. A. Alim, S. Habib, Y. Farooq, and A. Rafay, “Robust heart disease prediction: a novel approach based on significant feature and ensemble learning model,” in 2020 3rd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), 2020, pp. 1–5.

[12] R. Kannan and V. Vasanthi, “Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease,” in Soft computing and medical bioinformatics, Springer, 2019, pp. 63–72.

[13] R. Atallah and A. Al-Mousa, “Heart disease detection using machine learning majority voting ensemble method,” in 2019 2nd international conference on new trends in computing sciences (ictcs), 2019, pp. 1–6.

[14] P. S. Kohli and S. Arora, “Application of machine learning in disease prediction,” in 2018 4th International conference on computing communication and automation (ICCCA), 2018, pp. 1–4.

[15] A. Ed-Daoudy and K. Maalmi, “Real-time machine learning for early detection of heart disease using big data approach,” in 2019 international conference on wireless technologies, embedded and intelligent systems (WITS), 2019, pp. 1–5.

[16] C. for D. C. and P. (CDC), “Center for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS),” National Health and Nutrition Examination Survey (NHANES), 2020. about_nhanes.htm (accessed Oct. 01, 2022).

[17] C. for D. C. and P. (CDC), “Center for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS),” National Health and Nutrition Examination Survey (NHANES), 2020. default.aspx?Cycle=2017-2020 (accessed Oct. 01, 2020).

[18] H. Kang, “The prevention and handling of the missing data.,” Korean J. Anesthesiol., vol. 64, no. 5, pp. 402–406, May 2013, doi: 10.4097/kjae.2013.64.5.402.

[19] Center for Disease Control and Prevention (CDC)., “Center for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS),” Indicator Definitions Cardiovascular Disease, 2020. cardiovascular-disease.html (accessed Oct. 01, 2022).

[20] S. Jain, S. Shukla, and R. Wadhvani, “Dynamic selection of normalization techniques using data complexity measures,” Expert Syst. Appl., vol. 106, pp. 252–262, 2018.

[21] P. D. Allison, Missing data. Sage publications, 2001.

[22] D. B. Rubin and N. Schenker, “Multiple imputation for interval estimation from simple random samples with ignorable nonresponse,” J. Am. Stat. Assoc., vol. 81, no. 394, pp. 366–374, 1986.

[23] O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001.

[24] D. J. Stekhoven and P. Bühlmann, “MissForest—non-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, 2012.

[25] N. V Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning from imbalanced data sets,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 1–6, 2004.

[26] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017.

[27] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

[28] M. R. Smith, T. Martinez, and C. Giraud-Carrier, “An instance level analysis of data complexity,” Mach. Learn., vol. 95, no. 2, pp. 225–256, 2014.

[29] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004.

[30] G. George and V. C. Raj, “Review on feature selection techniques and the impact of SVM for cancer classification using gene expression profile,” arXiv Prepr. arXiv1109.1062, 2011.

[31] M. D. Anasanti, K. Hilyati, and A. Novtariany, “The Exploring feature selection techniques on Classification Algorithms for Predicting Type 2 Diabetes at Early Stage,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 6, no. 5, pp. 832–839, 2022.

[32] N. Barraza, S. Moro, M. Ferreyra, and A. de la Peña, “Mutual information and sensitivity analysis for feature selection in customer targeting: A comparative study,” J. Inf. Sci., vol. 45, no. 1, pp. 53–67, 2019.

[33] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

[34] Z. D. Yenice, N. Adhikari, Y. K. Wong, V. Aksakalli, A. T. Gumus, and B. Abbasi, “SPSA-FSR: Simultaneous perturbation stochastic approximation for feature selection and ranking,” arXiv Prepr. arXiv1804.05589, 2018.

[35] K. Siddique, Z. Akhtar, H. Lee, W. Kim, and Y. Kim, “Toward bulk synchronous parallel-based machine learning techniques for anomaly detection in high-speed big data networks,” Symmetry (Basel)., vol. 9, no. 9, p. 197, 2017.

[36] P. Geurts, A. Irrthum, and L. Wehenkel, “Supervised learning with decision tree-based methods in computational and systems biology,” Mol. Biosyst., vol. 5, no. 12, pp. 1593–1605, 2009.

[37] L. Jiang, Z. Cai, D. Wang, and S. Jiang, “Survey of improving k-nearest-neighbor for classification,” in Fourth international conference on fuzzy systems and knowledge discovery (FSKD 2007), 2007, vol. 1, pp. 679–683.

[38] S. K. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, classifiaction,” 1992.

[39] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.

[40] R. R. Sanni and H. S. Guruprasad, “Analysis of performance metrics of heart failured patients using Python and machine learning algorithms,” Glob. Transitions Proc., vol. 2, no. 2, pp. 233–237, 2021, doi:

[41] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” J. Big data, vol. 3, no. 1, pp. 1–40, 2016.

[42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255.

[43] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.

[44] MathWork, “Detector Performance Analysis Using ROC Curves - MATLAB,” Www.Mathworks.Com. (accessed Nov. 19, 2022).

[45] Q. Ren, X. Xie, Y. Tang, Q. Hu, and Y. Du, “Methyl tertiary-butyl ether inhibits THP-1 macrophage cholesterol efflux in vitro and accelerates atherosclerosis in ApoE-deficient mice in vivo,” J. Environ. Sci., vol. 101, pp. 236–247, 2021, doi:


Article Metrics

Abstract views : 4394 | views : 1426


  • There are currently no refbacks.

Copyright (c) 2023 IJCCS (Indonesian Journal of Computing and Cybernetics Systems)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Copyright of :
IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
ISSN 1978-1520 (print); ISSN 2460-7258 (online)
is a scientific journal the results of Computing
and Cybernetics Systems
A publication of IndoCEISS.
Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281
Fax: +62274 555133 |

View My Stats1
View My Stats2