Optimizing Surgical Site Infection Prediction Performance Through a Combined Approach of SMOTE-ENN, SMOTE-Tomek, and Wrapper Feature Selection
DOI:
https://doi.org/10.26713/cma.v16i3.3603Keywords:
Imbalanced classification, Combined sampling, Surgical site infection, Machine learning, Wrapper feature selectionAbstract
Risk stratification can be enhanced by assessing preoperative risk factors, which are essential for guiding surgical decision-making. Machine learning (ML) and AI-based expert systems can predict, detect, and monitor surgical site infections (SSIs) using data from electronic healthcare records (EHRs). The predictive capability of classification algorithms is impaired by class imbalance and depends heavily on the quality of features in a dataset, which may include irrelevant or redundant information. The primary goal of feature selection is to remove such features to improve classification accuracy. This study utilized a dataset of 64,793 surgical records, each featuring 25 variables, to evaluate six machine learning classification methods: logistic regression (LR), K-Nearest Neighbors
(KNN), decision tree classifier (DTC), support vector machine (SVM), Gaussian Naive Bayes (GNB), and artificial neural network (ANN). These techniques, intended to enhance classification accuracy, frequently prioritize the majority class in an imbalanced dataset, hence distorting the accuracy metric. To address the imbalanced classification problem in SSI prediction. The study applied the hybrid sampling techniques SMOTEMOTE-ENN and SMOTE-Tomek, combined with wrapper feature selection, to improve the model performance. The wrapper feature selection model optimizes the feature set by reducing the number of features while simultaneously enhancing the classification accuracy. Findings represented that the combined sampling of Edited Nearest Neighbors and Synthetic
Minority Oversampling Technique (SMOTE-ENN) with wrapper feature selection outperformed the SMOTE-Tomek in all performance metrics (KNN; AUC: 98, recall: 61, precision: 80, F1-score: 69, and accuracy: 98) and (DTC; AUC: 93, recall: 60, precision: 76, F1-score: 67, and accuracy: 98), particularly for minority class. When paired with SMOTE-ENN sampling and optimized through wrapper feature selection, the SSI prediction accuracy and AUC were significantly enhanced. The proposed method effectively mitigates the overfitting and underfitting issues, though the wrapper method is computationally intensive, which results in longer training times.
Downloads
References
K. Ahmed, T. R. Shahidi, S. M. I. Alam and S. Momen, Rice leaf disease detection using machine learning techniques, in: Proceedings of the 2019 International Conference on Sustainable Technologies for Industry 4.0 (STI, Dhaka, Bangladesh, 2019), pp. 1 – 5, (2019), DOI: 10.1109/sti47673.2019.9068096.
R. E. Al Mamlook, L. J. Wells and R. Sawyer, Machine-learning models for predicting surgical site infections using patient pre-operative risk and surgical procedure factors, American Journal of Infection Control 51(5) (2023), 544 – 550, DOI: 10.1016/j.ajic.2022.08.013.
J. M. Badia, A. L. Casey, N. Petrosillo, P. M. Hudson, S. A. Mitchell, C. Crosby, Impact of surgical site infection on healthcare costs and patient outcomes: a systematic review in six European countries, Journal of Hospital Infection 96(1) (2017), 1 – 15, DOI: 10.1016/j.jhin.2017.03.004.
M. A. Bartz-Kurycki, C. Green, K. T. Anderson, A. C. Alder, B. T. Bucher, R. A. Cina, R. Jamshidi, R. T. Russell, R. F. Williams and K. Tsao, Enhanced neonatal surgical site infection prediction model utilizing statistically and clinically significant variables in combination with a machine learning algorithm, The American Journal of Surgery 216(4) (2018), 764 – 777, DOI: 10.1016/j.amjsurg.2018.07.041.
G. Chandrashekar and F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40(1) (2014), 16 – 28, DOI: 10.1016/j.compeleceng.2013.11.024.
K. A. Chen, C. U. Joisa, J. Stem, J. G. Guillem, S. M. Gomez and M. R. Kapadia, Improved prediction of surgical-site infection after colorectal surgery using machine learning, Diseases of the Colon & Rectum 66(3) (2023), 458 – 466, DOI: 10.1097/dcr.0000000000002559.
S. Chowdhury and M. P. Schoen, Research paper classification using supervised machine learning techniques, in: Proceedings of the 2020 Intermountain Engineering, Technology and Computing (IETC2020, Orem, UT, USA, 2020), pp. 1 – 6, (2020), DOI: 10.1109/ietc47856.2020.9249211.
K. L. Colborn, M. Bronsert, E. Amioka, K. Hammermeister, W. G. Henderson and R. Meguid, Identification of surgical site infections using electronic health record data, American Journal of Infection Control 46(11) (2018), 1230 – 1235, DOI: 10.1016/j.ajic.2018.05.011.
A. Das, Logistic regression, in Encyclopedia of Quality of Life and Well-Being Research, Springer, Cham., pp. 1 – 2 (2021), DOI: 10.1007/978-3-319-69909-7_1689-2.
R. R. Fletcher, G. Schneider, B. Hedt-Gauthier, T. Nkurunziza, B. Alayande, R. Riviello and F. Kateera, Use of convolutional neural nets and transfer learning for prediction of surgical site infection from color images, in: Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC2021, Mexico, 2021), pp. 5047 – 5050, (2021), DOI: 10.1109/embc46164.2021.9630430.
R. R. Fletcher, O. Olubeko, H. Sonthalia, F. Kateera, T. Nkurunziza, J. L. Ashby, R. Riviello and B. Hedt-Gauthier, Application of machine learning to prediction of surgical site infection, in: Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2019, Berlin, Germany, 2019), pp. 2234 – 2237, (2019), DOI: 10.1109/embc.2019.8857942.
R. Ghorbani and R. Ghousi, Comparing different resampling methods in predicting students’ performance using machine learning techniques, IEEE Access 8 (2020), 67899 – 67911, DOI: 10.1109/access.2020.2986809.
J. M. Gutierrez-Naranjo, A. Moreira, E. Valero-Moreno, T. S. Bullock, L. A. Ogden and B. A. Zelle, A machine learning model to predict surgical site infection after surgery of lower extremity fractures, International Orthopaedics 48 (2024), 1887 – 1896, DOI: 10.1007/s00264-024-06194-5.
H. Hairani, A. Anggrawan and D. Priyanto, Improvement performance of the random forest method on unbalanced diabetes data classification using Smote-Tomek Link, International Journal on Informatics Visualization 7(1) (2023), 258 – 264, DOI: 10.30630/joiv.7.1.1069.
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue and G. Bing, Learning from classimbalanced data: Review of methods and applications, Expert Systems With Applications 73 (2017), 220 – 239, DOI: 10.1016/j.eswa.2016.12.035.
T. Humayun, A. M. Alkhamis, G. M. Saleh, M. A. Al Qahtani, M. AlSaedi, M. M. El Dalatony and K. H. Alanezi, Designing and implementing national program of Health Electronic Surveillance Network (HESN); Infection control module in Saudi Arabia, American Journal of Infectious Diseases and Microbiology 9(2) (2021), 61 – 70, URL: https://pubs.sciepub.com/ajidm/9/2/6/index.html.
N. Junsomboon and T. Phienthrakul, Combining over-sampling and under-sampling techniques for imbalance dataset, in: Proceedings of the 9th International Conference on Machine Learning and Computing (ICMLC’17), Association for Computing Machinery, New York, USA, pp. 243 – 247, (2017), DOI: 10.1145/3055635.3056643.
S. Khalid, T. Khalil and S. Nasreen, A survey of feature selection and feature extraction techniques in machine learning, in: Proceedings of the 2014 Science and Information Conference (SIC 2014, London, UK, 2014), pp. 372 – 378, (2014) DOI: 10.1109/sai.2014.6918213.
J. H. Kim, J.-K. Shin, H. Lee, D. H. Lee, J.-H. Kang, K. H. Cho, Y.-G. Lee, K. Chon, S.-S. Baek and Y. Park, Improving the performance of machine learning models for early warning of harmful algal blooms using an adaptive synthetic sampling method, Water Research 207 (2021), 117821, DOI: 10.1016/j.watres.2021.117821.
P. Kocbek, N. Fijacko, C. Soguero-Ruiz, K. Ø. Mikalsen, U. Maver, P. Povalej Brzan, A. Stozer, R. Jenssen, S. O. Skrøvseth and G. Stiglic, Maximizing interpretability and cost-effectiveness of Surgical Site Infection (SSI) predictive models using feature-specific regularized logistic regression on preoperative temporal data, Computational and Mathematical Methods in Medicine 2019(1) (2019), 2059851, DOI: 10.1155/2019/2059851.
M. H. Kotb and R. Ming, Comparing SMOTE family techniques in predicting insurance premium defaulting using machine learning models, International Journal of Advanced Computer Science and Applications 12(9) (2021), 621 – 629, DOI: 10.14569/ijacsa.2021.0120970.
M. S. Kraiem, F. Sánchez-Hernández and M. N. Moreno-García, Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. An approach based on association model, Applied Sciences 11(18) (2021), 8546, DOI: 10.3390/app11188546.
M. A. C. Lengua, E. A. P. Quiroz, A systematic literature review on support vector machines applied to classification, in: Proceedings of the 2020 IEEE Engineering International Research Conference (EIRCON 2020, Lima, Peru, 2020), pp. 1 – 4, (2020), DOI: 10.1109/eircon51178.2020.9254028.
W.-C. Liu, H. Ying, W.-J. Liao, M.-P. Li, Y. Zhang, K. Luo, B.-L. Sun, Z.-L. Liu and J.-M. Liu, Using preoperative and intraoperative factors to predict the risk of surgical site infections after lumbar spinal surgery: A machine learning–based study, World Neurosurgery 162 (2022), e553 – e560, DOI: 10.1016/j.wneu.2022.03.060.
K. Lu, Y. Tu, S. Su, J. Ding, X. Hou, C. Dong, H. Jin and W. Gao, Machine learning application for prediction of surgical site infection after posterior cervical surgery, International Wound Journal 21(4) (2024), e14607, DOI: 10.1111/iwj.14607.
A. Luque, A. Carrasco, A. Martín and A. de Las Heras, The impact of class imbalance in classification performance metrics based on the binary confusion matrix, Pattern Recognition 91 (2019), 216 – 231, DOI: 10.1016/j.patcog.2019.02.023.
A. A. Lydia and S. Francis, Adagrad—An optimizer for stochastic gradient descent, International Journal of Information and Computing Science 6 (2019), 566 – 568.
A. Mahani and A. R. B. Ali, Classification problem in imbalanced datasets, in: Recent Trends in Computational Intelligence, A. Sadollah and T. S. Sinha (editors), IntechOpen, London, (2020), DOI: 10.5772/intechopen.78839.
V. Nasteski, An overview of the supervised machine learning methods, Horizons Series B 4 (2017), 51 – 62.
F. Nieto-del-Amor, G. Prats-Boluda, J. Garcia-Casado, A. Diaz-Martinez, V. J. Diago-Almela, R. Monfort-Ortiz, D. Hao and Y. Ye-Lin, Combination of feature selection and resampling methods to predict preterm birth based on electrohysterographic signals from imbalance data, Sensors 22(14) (2022), 5098, DOI: 10.3390/s22145098.
M. Ontivero-Ortega, A. Lage-Castellanos, G. Valente, R. Goebel and M. Valdes-Sosa, Fast Gaussian Naïve Bayes for searchlight classification analysis, NeuroImage 163 (2017), 471 – 479, DOI: 10.1016/j.neuroimage.2017.09.001.
A. Parmar, R. Katariya and V. Patel, A review on random forest: An ensemble classifier, in: Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI 2018), Lecture Notes on Data Engineering and Communications Technologies, Vol. 26, Springer, Cham., (2019), DOI: 10.1007/978-3-030-03146-6_86.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss and V. Dubourg, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011), 2825 – 2830, URL: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf.
Y. Petrosyan, K. Thavorn, G. Smith, M. Maclure, R. Preston, C. van Walravan and A. J. Forster, Predicting postoperative surgical site infection with administrative data: a random forests algorithm, BMC Medical Research Methodology 21 (2021), article number 179, DOI: 10.1186/s12874-021-01369-9.
Y. Petrosyan, K. Thavorn, M. Maclure, G. Smith, D. McIsaac, D. Schramm, H. Moloo, P. Roanne and A. J. Alan, Long-term health outcomes and health system costs associated with surgical site infections: A retrospective cohort study, Annals of Surgery 273(5) (2021), 917 – 923, DOI: 10.1097/SLA.0000000000003285.
A. Samareh, X. Chang, W. B. Lober, H. L. Evans, Z. Wang, X. Qian and S. Huang, Artificial intelligence methods for surgical site infection: Impacts on detection, monitoring, and decision making, Surgical Infections 20(7) (2019), 546 – 554, DOI: 10.1089/sur.2019.150.
J. Song, E. Sanabria-Buenaventura, B. Cohen, J. Liu, D. Yao and E. Larson, Predictive Models for Surgical Site Infection (SSI) in Patients with a Permanent Pacemaker (PPM) using machine learning methods, Authorea Preprints 2020 (2020), DOI: 10.22541/au.159188536.65812462.
W. Sun, Z. Cai, Y. Li, F. Liu, S. Fang and G. Wang, Data processing and text mining technologies on electronic medical records: A review, Journal of Healthcare Engineering 2018(1) (2018), 4302425, DOI: 10.1155/2018/4302425.
C. S. Tarimo, S. S. Bhuyan, Q. Li, W. Ren, M. J. Mahande and J. Wu, Combining resampling strategies and ensemble machine learning methods to enhance prediction of neonates with a low Apgar score after induction of labor in northern Tanzania, Risk Management and Healthcare Policy 14 (2021), 3711 – 3720, DOI: 10.2147/rmhp.s331077.
K. Taunk, S. De, S. Verma and A. Swetapadma, A brief review of nearest neighbor algorithm for learning and classification, in: 2019 International Conference on Intelligent Computing and Control Systems (ICCS 2019, Madurai, India, 2019), pp. 1255 – 1260, DOI: 10.1109/iccs45141.2019.9065747.
Z. Ullah, F. Saleem, M. Jamjoom, B. Fakieh, F. Kateb, A. M. Ali and B. Shah, Detecting high-risk factors and early diagnosis of diabetes using machine learning methods, Computational Intelligence and Neuroscience 2022(1) (2022), 2557795, DOI: 10.1155/2022/2557795.
C. A. Umscheid, M. D. Mitchell, J. A. Doshi, R. Agarwal, K. Williams and P. Brennan, Estimating the proportion of healthcare-associated infections that are reasonably preventable and the related mortality and costs, Infection Control & Hospital Epidemiology 32(2) (2011), 101 – 114, DOI: 10.1086/657912.
S. Visalakshi and V. Radha, A literature review of feature selection techniques and applications: Review of feature selection in data mining, in: 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCICR 2014, Coimbatore, India, 2014), pp. 1 – 6, (2014), DOI: 10.1109/iccic.2014.7238499.
Z. Wang and Q. Liu, Imbalanced data classification method based on LSSASMOTE, IEEE Access 11 (2023), 32252 – 32260, DOI: 10.1109/access.2023.3262460.
C. B. Weir and A. Jan, BMI Classification Percentile and Cut off Points, StatPearls Publishing, Treasure Island, (2021).
H. Xiaoli and S. Qiang, A hybrid FRFS-CSRF model for surgical site infection prediction, in: 2019 16th International Conference on Service Systems and Service Management (ICSSSM 2019, Shenzhen, China, 2019), pp. 1 – 4, (2019), DOI: 10.1109/icsssm.2019.8887701.
C. Xiong, R. Zhao, J. Xu, H. Liang, C. Zhang, Z. Zhao, T. Huang and X. Luo, Construct and validate a predictive model for surgical site infection after posterior lumbar interbody fusion based on machine learning algorithm, Computational and Mathematical Methods in Medicine 2022(1) (2022), 2697841, DOI: 10.1155/2022/2697841.
Z. Xu, D. Shen, T. Nie and Y. Kou, A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data, Journal of Biomedical Informatics 107 (2020), 103465, DOI: 10.1016/j.jbi.2020.103465.
S. Yang and G. Berdine, The receiver operating characteristic (ROC) curve, The Southwest Respiratory and Critical Care Chronicles 5(19) (2017), 34 – 36, URL: https://doi.org/10.12746/swrccc.v5i19.391.
F. Yang, K. Wang, L. Sun, M. Zhai, J. Song and H. Wang, A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis, BMC Medical Informatics and Decision Making 22(1) (2022), article number 344, DOI: 10.1186/s12911-022-02075-2.
B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin and N. N. Abdullah, An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets, in: Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), T. Herawan, M. Deris and J. Abawajy (editors), Lecture Notes in Electrical Engineering, Vol. 285, Springer, Singapore, (2014), DOI: 10.1007/978-981-4585-18-7_2.
H. Ying, B.-H. Guo, H.-J. Wu, R.-P. Zhu, W.-C. Liu and H.-F. Zhong, Using multiple indicators to predict the risk of surgical site infection after ORIF of tibia fractures: A machine learning based study, Frontiers in Cellular and Infection Microbiology 13 (2023), 1206393, DOI: 10.3389/fcimb.2023.1206393.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a CCAL that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.



