Incidence, risk factors, and predictive modeling of pulmonary infection after high-risk surgery for lung cancer: a retrospective case-control study

Jiajia Ma; Bei Xue; Zhengmin Zhang; Liping Yao; Xiaoxin Liu

doi:10.21037/jtd-2024-2276

Original Article

Incidence, risk factors, and predictive modeling of pulmonary infection after high-risk surgery for lung cancer: a retrospective case-control study

Jiajia Ma^1,2 , Bei Xue¹, Zhengmin Zhang¹, Liping Yao¹, Xiaoxin Liu^1,2

¹Nursing Department, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; ²Shanghai Jiao Tong University School of Nursing, Shanghai, China

Contributions: (I) Conception and design: J Ma, X Liu; (II) Administrative support: J Ma, B Xue; (III) Provision of study materials or patients: B Xue, Z Zhang; (IV) Collection and assembly of data: J Ma, L Yao; (V) Data analysis and interpretation: J Ma, X Liu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Xiaoxin Liu, PhD. Nursing Department, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, No. 241, Huaihai West Road, Xuhui District, Shanghai 200030, China; Shanghai Jiao Tong University School of Nursing, Shanghai, China. Email: lxx1018@hotmail.com.

Background: The hierarchical operation management system is one of the core medical systems. Graded management based on the degree of surgical risk, difficulty, resource consumption, and ethical risks can help ensure the quality and safety of the surgery. With the progress of medical technology and the continuous development of medical standards, the proportion of lung cancer patients who underwent high-risk surgery was increasing rapidly. The purpose of this study is to explore the incidence, risk factors, and prediction models of pulmonary infection after high-risk surgery for lung cancer based on machine learning algorithms.

Methods: This study included individuals who underwent lung cancer high-risk surgery at Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine from January 2021 to December 2023. Five machine learning algorithms including least absolute shrinkage and selection operator (LASSO)-assisted logistic regression (LR), artificial neural network (ANN), support vector machine (SVM), random forest (RF), and eXtreme gradient boosting (XGB) were adopted to explore risk factors and prediction models of pulmonary infection after high-risk surgery for lung cancer.

Results: A cohort of 2,650 patients were eligible for the study after application of the exclusion criteria, with an overall incidence of postoperative pulmonary infection at 9.66% (256/2,650). LASSO regression screened out eight characteristic variables including daily smoking, history of diabetes, diffusing capacity of the lung for carbon monoxide percentage of predicted (DLCO%Pred), airway resistance percentage of predicted (Raw%Pred), maximum tumor diameter, perioperative oral nutritional supplements (ONS) supplement, postoperative urinary catheter, and pleural adhesion degree. The risk prediction model of postoperative pulmonary infection was constructed using these eight clinical features. The area under the curve (AUC) range of the five models was 0.893–0.936. The XGB model outperformed the others, with an AUC of 0.936 [95% confidence interval (CI): 0.923–0.949]. The LR model had an AUC of 0.927 (95% CI: 0.921–0.939), second only to the XGB model, which was converted into a nomogram for model visualization.

Conclusions: The establishment of a risk prediction model based on machine learning can help clinical nursing staff identify high-risk patients for pulmonary infection after lung cancer high-risk surgery. The nomogram is expected to be an effective tool for nursing staff to manage the risk of pulmonary infection after lung cancer high-risk surgery.

Keywords: Lung cancer; high-risk surgery; pulmonary infection; risk prediction; machine learning

Submitted Dec 30, 2024. Accepted for publication Apr 09, 2025. Published online Jun 10, 2025.

doi: 10.21037/jtd-2024-2276

Highlight box

Key findings

• Eight characteristic variables, including daily smoking, history of diabetes, diffusing capacity of the lung for carbon monoxide percentage of predicted, airway resistance percentage of predicted, maximum tumor diameter, perioperative oral nutritional supplements supplement, postoperative urinary catheter, and pleural adhesion degree were important risk factors for postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery. The eXtreme gradient boosting (XGB) algorithm performed best, and we could use the least absolute shrinkage and selection operator-assisted logistic regression (LR) model to draw a nomogram for model visualization.

What is known and what is new?

• As the most common postoperative complication in lung cancer patients who underwent high-risk surgery, pulmonary infection has a high clinical and economic impact. At present, more and more scholars are concerned about the prevention and control of pulmonary complications after difficult and complex surgery. But at present, there are few studies on the risk of pulmonary infection in a specific group of patients who underwent high-risk surgery, and the further research is needed.

• We used five machine learning algorithms including LR, artificial neural network, support vector machine, random forest, and XGB to construct prediction models for postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery.

What is the implication, and what should change now?

• This suggests that we should pay more attention to the above eight risk factors in clinical work, formulate comprehensive intervention programs, and provide data support for the formulation of prevention strategies for postoperative pulmonary infection. And we can use XGB to improve and verify the model with a larger multi-center database in the subsequent research, and to improve the model’s extensibility.

Introduction

Background

Lung cancer is the most common malignant tumor in the world with the highest standardized incidence rate and standardized mortality rate (1). The incidence, mortality, and burden of lung cancer in China continued to increase between 2005 and 2020 (2). In the next 30 years, lung cancer will cause a global economic loss of 3.9 trillion US dollars and have the heaviest medical burden among all malignant tumors (3). Improving the utilization rate of medical resources for lung cancer has become an urgent problem for medical staff. Surgery is the best treatment for patients with early-stage lung cancer, and thoracoscopic surgery is recommended by many national lung cancer treatment guidelines. The surgery grading management is one of the core medical systems, and medical institutions divide surgical operations into different levels. The high-risk surgery refers to the operation with high risk, complicated process, difficulty, resource consumption or major ethical risk. With the continuous development of medical standards and the continuous updating of lung cancer surgery guidelines (4), the proportion of patients undergoing high-risk surgery for lung cancer is increasing. Although thoracoscopic surgery is less invasive than traditional thoracotomy, the operation involves large thoracic vessels, and the intraoperative operation would cause severe damage to patients’ cardiopulmonary function, leading to the proportion of pulmonary complications after surgery of as high as 30% (5). At present, more and more scholars are concerned about the prevention and control of pulmonary complications after difficult and complex surgery (6). As the most common postoperative complication in lung cancer patients who underwent high-risk surgery, pulmonary infection has a high clinical and economic impact, leading to longer hospital stays, increased medical costs, increased rehospitalization rates, and increased long-term mortality risk (7). At the same time, as a preventable postoperative comorbidity (8), early identification of high-risk factors for postoperative pulmonary infection and timely intervention are of great significance to reduce the occurrence of complications and related cost burdens, improve the utilization of medical resources, and improve the overall prognosis of lung cancer patients after surgery.

Rationale and knowledge gap

In recent years, the risk factors of postoperative pulmonary infection in lung cancer patients who underwent surgery have received extensive attention from scholars (9-11). Some studies have built risk assessment tools, but these tools in those studies are highly subjective, and the included risk factors and research results are different among different studies, with limitations in the results. In addition, those studies are aimed at all lung cancer patients, lack pertinent for patients who underwent surgery involving difficult surgical techniques, high risks, and high consumption of medical resources. The strategy for prediction and prevention of postoperative pulmonary infection in patients who underwent such surgery is still lacking. At present, artificial intelligence technologies such as machine learning algorithms have shown unique advantages in medical data processing and mining, assisting clinical medical personnel in the diagnosis and detection of disease complications. The construction of risk prediction model based on machine learning algorithms is expected to become an effective tool for medical stuff to assess the risk of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery.

Objective

Our goal is using machine learning algorithms to enable computers to learn from various clinical data of patients through hospital information system, developing an accurate postoperative pulmonary infection risk prediction model for patients who underwent high-risk surgery. We would like to enable medical staff to timely identify potential patients and to provide data support for medical staff to make clinical decision as soon as possible and take targeted intervention measures. We present this article in accordance with the TRIPOD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2024-2276/rc).

Methods

Study design and ethics

A retrospective case-control study was conducted in Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine. The study was approved by the Institutional Review Committee of Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine (No. KS23016). The study procedure was strictly in accordance with the Declaration of Helsinki and its subsequent amendments. Given the retrospective nature of the study, patient’s informed consent was considered an exemption. All patients’ personal information was encrypted to prevent disclosure.

Participants

We retrospectively collected clinical data from lung cancer patients who underwent high-risk surgery at Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine from January 2021 to December 2023. A computer-generated random number sequence was used to randomly split the entire cohort into two sets, namely the training cohort and the internal validation cohort (at a ratio of 7:3). The inclusion criteria were as follows: (I) patients ≥18 years of age; (II) patients diagnosed with primary lung cancer (12); (III) patients with no distant lung cancer metastasis; (IV) patients had not undergone radiotherapy, chemotherapy or targeted drug therapy; and (V) patients who underwent high-risk surgery such as thoracoscopic lobectomy with or without pulmonary segmentectomy, thoracoscopic anatomic pulmonary segmentectomy, thoracoscopic total lung resection with or without lymph node dissection, thoracoscopic reconstruction with or without pulmonary lobectomy, thoracoscopic residual lung resection with or without lymph node dissection, etc. The exclusion criteria encompassed the following: (I) patients with infectious disease before surgery; (II) patients treated with immunosuppressants or glucocorticoids 1 month before surgery; (III) patients with severe heart, liver or kidney failure; and (IV) patients with a clear diagnosis of mental illness or lack of normal cognitive or communication skills.

Diagnosis of postoperative pulmonary infection

The diagnosis of pulmonary infection in this study was made according to the Diagnostic Criteria for Nosocomial Infection issued by the Ministry of Health of China and the Guidelines for the Diagnosis and Treatment of Chinese Adult Hospital-Acquired Pneumonia and Ventilators Associated Pneumonia (2018 Edition) (13), patients can be diagnosed with any three of the following five criteria: (I) cough, purulent sputum or aggravation of original respiratory symptoms; (II) body temperature ≥38 ℃, with blood routine indicating that the white blood cell count was increased significantly; (III) physical examination indicating dry and wet rales or signs of lung consolidation; (IV) chest imaging examination indicating inflammatory changes or new or progressive infiltrating shadow, patchy infiltrating shadow or interstitial changes; and (V) sputum culture testing positive for pathogenic bacteria. The postoperative pulmonary infection in this study refers to the signs of pulmonary infection in patients with lung cancer within 1 week after the high-risk surgery.

Data collection

We used the hospital’s electronic medical record archiving system to obtain risk information extraction of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery. (I) Personal information: gender, age, body mass index (BMI), daily smoking, history of hypertension, and history of diabetes. (II) Preoperative pulmonary function: forced expiratory volume in 1 second/forced vital capacity (FEV1/FVC), diffusing capacity of the lung for carbon monoxide percentage of predicted (DLCO%Pred), airway resistance percentage of predicted (Raw%Pred), and forced vital capacity (FVC). (III) Surgical information: surgical site, number of indentation drainage tubes, intraoperative pathologic type, intraoperative hilar activity, intraoperative pleural adhesion degree, tumor infiltration degree, maximum tumor diameter. (IV) Perioperative nursing information: 24 hours postoperative activities of daily living, 24 hours postoperative critical modified early warning score (MEWS), properties of thoracic fluid 24 hours after surgery, thoracic fluid volume 24 hours after surgery, perioperative oral nutritional supplements (ONS) supplement and postoperative urinary catheter.

We used the examination report inquiry system, the electronic medical record system, and the nursing record system to obtain risk information extraction of postoperative pulmonary infection diagnosis information, including the patient’s white blood cell count, sputum culture results, chest imaging reports, respiratory symptom records, and temperature records.

Missing data

Variables with “near-zero variance” have little predictive value, and we remove these predictors before modeling. In the variables set as non-essential, we decided to exclude variables with a proportion of missing data >5%. During data entry, our database prevents final validation of the case if the mandatory data is missing. K-nearest neighbors was used as the method to impute the remaining missing value. K similar samples are found and filled according to the value of neighbors.

Statistical analysis

Counts (proportions) were used to represent the categorical variables, and the quantitative data with a skewed distribution were expressed as the median [interquartile range (IQR)]. The Chi-squared test or Mann-Whitney U test was used to compare the differences between the two groups between the training cohort and the validation cohort. We used the least absolute shrinkage and selection operator (LASSO) regression analysis to select features corresponding to non-zero coefficients for 23 clinical features, screened out the target variables with the greatest predictive power. LASSO-assisted logistic regression (LR), artificial neural network (ANN), support vector machine (SVM), random forest (RF), and eXtreme gradient boosting (XGB), the five machine learning algorithms were used to construct prediction models for postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery. The optimal parameters were obtained through grid search, and the 10-fold cross-validation was used for parameter tuning, and the model was retrained and internally verified to determine the final model. The area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score (F1), Brier score, and plot calibration curves to measure model performance. We used the SHapley Additive exPlanations (SHAP) method for interpreting machine learning models and providing insights into how individual variables influence predictions. The LR model was corrected by adjustment of the intercept and the regression coefficients using the calibration intercept and calibration slope (14), the LR model was converted into a nomogram for the visualization in clinical applications. The analyses were implemented in R version 4.1.

Results

Cohort comparison

A total of 2,650 eligible patients were enrolled in the study (Figure 1). The enrolled patients were randomly divided into the training cohort (n=1,855) and the validation cohort (n=795) at a ratio of 7:3. There was no significant difference in the features between the training cohort and the validation cohort (P>0.05) (Table 1). The incidence of postoperative pulmonary infection was 9.65% and 9.69% in the training and validation cohorts.

Figure 1 Patient recruitment flowchart.

Table 1

Features in the training cohort and the validation cohort

Features	Total patients (n=2,650)	Training cohort (n=1,855)	Validation cohort (n=795)	P value
Pulmonary infection				0.81
Yes	256 (9.66)	179 (9.65)	77 (9.69)
No	2,394 (90.34)	1,676 (90.35)	718 (90.31)
Gender				0.56
Male	1,096 (41.36)	758 (40.86)	338 (42.52)
Female	1,554 (58.64)	1,097 (59.14)	457 (57.48)
Age (years)				0.43
<45	436 (16.45)	303 (16.33)	133 (16.73)
45–65	1,575 (59.43)	1,111 (59.89)	464 (58.36)
>65	639 (24.11)	441 (23.77)	198 (24.91)
BMI (kg/m²)				0.51
<18.5	214 (8.08)	152 (8.19)	62 (7.80)
18.5–24	2,106 (79.47)	1,476 (79.57)	630 (79.25)
24.1–28	180 (6.79)	126 (6.79)	54 (6.79)
>28	150 (5.66)	101 (5.44)	49 (6.16)
Daily smoking				0.61
No smoking	2,247 (84.79)	1,566 (84.42)	681 (85.66)
<10 roots	152 (5.74)	110 (5.93)	42 (5.28)
10–20 roots	146 (5.51)	102 (5.50)	44 (5.53)
>20 roots	105 (3.96)	77 (4.15)	28 (3.52)
History of hypertension				0.18
Yes	274 (10.34)	194 (10.46)	80 (10.06)
No	2,376 (89.66)	1,661 (89.54)	715 (89.94)
History of diabetes				0.53
Yes	220 (8.30)	153 (8.25)	68 (8.55)
No	2,430 (91.70)	1,702 (91.75)	727 (91.45)
FEV1/FVC				0.57
FEV1/FVC ≥80%	1,707 (64.42)	1,187 (63.99)	520 (65.41)
70%≤ FEV1/FVC <80%	549 (20.72)	381 (20.54)	168 (21.13)
50%≤ FEV1/FVC <70%	268 (10.11)	195 (10.51)	73 (9.18)
FEV1/FVC <50%	126 (4.75)	92 (4.96)	34 (4.28)
DLCO%Pred				0.53
DLCO%Pred ≥80%	2,011 (75.89)	1,407 (75.85)	604 (75.97)
60%≤ DLCO%Pred <80%	505 (19.06)	352 (18.98)	153 (19.25)
40%≤ DLCO%Pred <60%	87 (3.28)	59 (3.18)	28 (3.52)
DLCO%Pred <40%	47 (1.77)	37 (1.99)	10 (1.26)
Raw%Pred				0.22
Raw%Pred ≤120%	885 (33.40)	617 (33.26)	268 (33.71)
120%< Raw%Pred ≤140%	807 (30.45)	558 (30.08)	249 (31.32)
140%< Raw%Pred ≤160%	745 (28.11)	527 (28.41)	218 (27.42)
Raw%Pred >160%	213 (8.04)	153 (8.25)	60 (7.55)
FVC (L)				0.54
<3	504 (19.02)	361 (19.46)	143 (17.99)
3–4	1,949 (73.55)	1,355 (73.05)	594 (74.72)
>4	197 (7.43)	139 (7.49)	58 (7.30)
Surgical site				0.62
Left lung	1,035 (39.06)	725 (39.08)	310 (38.99)
Right lung	1,479 (55.81)	1,038 (55.96)	441 (55.47)
Both lungs	136 (5.13)	92 (4.96)	44 (5.53)
Number of intraoperative indwelling drainage tubes				0.54
Single	974 (36.75)	673 (36.28)	301 (37.86)
Two or more	1,676 (63.25)	1,182 (63.72)	494 (62.14)
Intraoperative pathological type				0.83
Lung squamous cell carcinoma	2,425 (91.51)	1,700 (91.64)	725 (91.19)
Lung adenocarcinoma	184 (6.94)	126 (6.79)	58 (7.30)
Lung adenosquamous carcinoma	41 (1.55)	29 (1.56)	12 (1.51)
Intraoperative hilar activity				0.71
Normal motion	1,993 (75.21)	1,403 (75.63)	590 (74.21)
Strong activity	657 (24.79)	452 (24.37)	205 (25.79)
Pleural adhesion degree				0.69
Non-adhesion	2,289 (86.38)	1,600 (86.25)	689 (86.67)
Minor adhesion	178 (6.72)	126 (6.79)	52 (6.54)
Moderate adhesion	106 (4.00)	71 (3.83)	35 (4.40)
Extensive adhesion	77 (2.91)	58 (3.13)	19 (2.39)
Degree of tumor invasion				0.48
Invasive carcinoma	395 (14.91)	282 (15.20)	113 (14.21)
Microinvasive carcinoma	742 (28.00)	516 (27.82)	226 (28.43)
Carcinoma in situ	1,513 (57.09)	1,057 (56.98)	456 (57.36)
Maximum tumor diameter (cm)				0.33
<1	682 (25.74)	486 (26.20)	196 (24.65)
1–2	1,003 (37.85)	696 (37.52)	307 (38.62)
>2	965 (36.42)	673 (36.28)	292 (36.73)
Postoperative activities of daily living				0.50
Mild dependence	120 (4.53)	82 (4.42)	38 (4.78)
Moderate dependence	338 (12.75)	237 (12.78)	101 (12.70)
Heavily dependence	2,192 (82.72)	1,536 (82.80)	656 (82.52)
Postoperative critical MEWS (points)				0.45
<4	2,117 (79.89)	1,488 (80.22)	629 (79.12)
4–5	511 (19.28)	353 (19.03)	158 (19.87)
>5	22 (0.83)	14 (0.75)	8 (1.01)
Properties of thoracic fluid 24 hours after surgery				0.71
Serous	88 (3.32)	66 (3.56)	22 (2.77)
Light bloody	1,810 (68.30)	1,264 (68.14)	546 (68.68)
Bloody	752 (28.38)	525 (28.30)	227 (28.55)
Thoracic fluid volume 24 hours after surgery (mL)				0.26
<100	221 (8.34)	152 (8.19)	69 (8.68)
100–300	1,682 (63.47)	1,174 (63.29)	508 (63.90)
301–500	670 (25.28)	476 (25.66)	195 (24.53)
>500	77 (2.91)	53 (2.86)	23 (2.89)
Perioperative ONS supplement				0.35
Yes	2,158 (81.43)	1,503 (81.02)	655 (82.39)
No	492 (18.57)	352 (18.98)	140 (17.61)
Postoperative urinary catheter				0.17
Yes	2,445 (92.26)	1,715 (92.45)	731 (91.95)
No	205 (7.74)	140 (7.55)	64 (8.05)

Data are presented as n (%). BMI, body mass index; DLCO%Pred, diffusing capacity of the lung for carbon monoxide percentage of predicted; FEV1/FVC, forced expiratory volume in 1 second/forced vital capacity; FVC, forced vital capacity; MEWS, modified early warning score; ONS, oral nutritional supplements; Raw%Pred, airway resistance percentage of predicted.

Screening of risk factors

The choice of the parameter λ is crucial in the LASSO modeling and variable screening process, as it directly affects the degree to which the coefficients are compressed in the model. We used LASSO regression analysis to select the features corresponding to the non-zero coefficients of 23 variables, and the λ value was 0.017 when the maximum value within one standard error of the minimum mean square error was used, the model performed best and the number of variables was minimal. Therefore, we chose the lambda value of 0.017 under 10 cross-validations as the optimal value of the model for feature screening. Finally, eight characteristic variables, including daily smoking, history of diabetes, DLCO%Pred, Raw%Pred, maximum tumor diameter, perioperative ONS supplement, postoperative urinary catheter, and pleural adhesion degree, were screened for model construction (Figure 2). LASSO regression results of important variables related to pulmonary infection after lung cancer high-risk surgery were shown in Table 2.

Figure 2 LASSO regression analysis of postoperative pulmonary infection in lung cancer patients undergoing high-risk surgery. (A) Variation of the hyperparameter λ and mean-squared-error in LASSO regression. (B) The coefficient profiles of clinic features. LASSO, least absolute shrinkage and selection operator.

Table 2

LASSO regression results of important variables related to pulmonary infection after/lung cancer high-risk surgery in the training cohort

Variables	Coefficient	Lambda.min
Daily smoking	0.131726	0.017
History of diabetes	0.098115
DLCO%Pred	0.068246
Raw%Pred	0.147686
Maximum tumor diameter	0.053361
Perioperative ONS supplement	0.032676
Postoperative urinary catheter catheters	0.021583
Pleural adhesion degree	0.045218

DLCO%Pred, diffusing capacity of the lung for carbon monoxide percentage of predicted; LASSO, least absolute shrinkage and selection operator; lambda.min, minimum value of lambda; ONS, oral nutritional supplements; Raw%Pred, airway resistance percentage of predicted.

Construction of prediction model based on five machine learning algorithms

The performance of the five models is summarized in Table 3. The AUCs for the five models ranged from 0.893 to 0.936, the XGB model outperformed the others. Figure 3 shows the AUC ranges (Figure 3A,3B) and the model calibration curves (Figure 3C,3D) of the five models in the training and validation cohorts. The XGB model showed that the top five important variables were daily smoking, history of diabetes, postoperative urinary catheter, DLCO%Pred, and perioperative ONS supplement. In addition, we tested five kernels for SVM model: linear, polynomial, Gaussian radial basis, sigmoid, and Laplacian. The test results show that the error of the Gaussian radial basis was minimal. For ANN model, the lowest error rate was found with 64 neurons in the hidden layer. For RF model, the cross-validation error in the RF model at eight variables was minimal and stable after 100 trees.

Table 3

Five model performances training cohort and validation cohort

Models	AUC (95% CI)	Accuracy (95% CI)	Sensitivity	Specificity	PPV	NPV	Precision	Recall	F1	Brier
Training cohort
LR	0.927 (0.921–0.939)	0.902 (0.885–0.917)	0.903	0.837	0.862	0.743	0.883	0.859	0.832	0.097
ANN	0.924 (0.901–0.941)	0.887 (0.874–0.901)	0.906	0.817	0.877	0.664	0.886	0.867	0.775	0.113
SVM	0.893 (0.882–0.906)	0.873 (0.865–0.892)	0.917	0.869	0.865	0.706	0.905	0.894	0.835	0.109
RF	0.916 (0.902–0.927)	0.882 (0.863–0.889)	0.922	0.862	0.846	0.686	0.901	0.881	0.875	0.086
XGB	0.936 (0.923–0.949)	0.903 (0.876–0.917)	0.931	0.887	0.885	0.758	0.909	0.897	0.902	0.073
Validation cohort
LR	0.907 (0.892–0.927)	0.889 (0.873–0.912)	0.907	0.816	0.847	0.747	0.871	0.865	0.801	0.114
ANN	0.902 (0.883–0.915)	0.881 (0.872–0.901)	0.895	0.822	0.825	0.689	0.825	0.852	0.768	0.121
SVM	0.895 (0.884–0.907)	0.858 (0.842–0.867)	0.881	0.815	0.798	0.722	0.855	0.841	0.770	0.096
RF	0.911 (0.891–0.926)	0.882 (0.866–0.901)	0.889	0.788	0.793	0.691	0.806	0.843	0.762	0.103
XGB	0.916 (0.901–0.933)	0.895 (0.884–0.915)	0.912	0.824	0.879	0.753	0.895	0.868	0.806	0.084

ANN, artificial neural network; AUC, area under the curve; CI, confidence interval; F1, F1 score; LR, least absolute shrinkage and selection operator-assisted logistic regression; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; SVM, support vector machine; XGB, eXtreme gradient boosting.

Figure 3 ROC curve and calibration plots of different machine learning models in the training cohort and validation cohort. (A,C) The training cohort. (B,D) The validation cohort. ANN, artificial neural network; LR, least absolute shrinkage and selection operator-assisted logistic regression; RF, random forest; ROC, receiver operating characteristic; SVM, support vector machine; XGB, eXtreme gradient boosting.

The XGB model explained via the SHAP method

In our study, the XGB model achieved the highest AUC. We combine the methods of SHAP to assess the relevance of every predictor variable. A high SHAP value has a positive impact on the output of the model, while a low SHAP value has the opposite effect. Ultimately, a thorough analysis was completed for the assimilation of eight variables. Figure 4A was a SHAP summary plot of the XGB model feature importance ranking. Each point on the summary plot was a Shapley value for a feature and an instance, with a position on the Y-axis determined by the feature importance and a Shapley value on the X-axis. The overlap points float in the Y-axis direction, so we can understand the distribution of Shapley values for each feature. The horizontal position shows the degree to which the value was high or low relative to the predicted value: red dots indicate high risk, and blue dots indicate low risk. Moreover, Figure 4B shows the significance of each predictor variable for the predicted results of the XGB model, and the variable significance plot lists the most significant variables in descending order.

Figure 4 SHAP of the XGB model. (A) Each characteristic attribute in SHAP. (B) The SHAP-based feature importance ranking. DLCO%Pred, diffusing capacity of the lung for carbon monoxide percentage of predicted; ONS, oral nutritional supplements; Raw%Pred, airway resistance percentage of predicted; SHAP, SHapley Additive exPlanations; XGB, eXtreme gradient boosting.

Establishment of postoperative pulmonary infection nomogram model for lung cancer patients who underwent high-risk surgery

Most risk prediction models based on machine learning algorithms are not suitable for visualization, especially in medical units with imperfect information systems, which cannot quickly extract clinical data and are not convenient for clinical medical staff to use directly. In this study, the performance of the LR model was good, with an AUC of 0.927 [95% confidence interval (CI): 0.921–0.939], second only to the XGB model. Therefore, based on the results of LR results for the training cohort (Table 4), a nomogram of the risk of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery was established to facilitate its use in different clinical situations (Figure 5). In the specific application of the nomogram, the length of a line segment reflects its contribution to the resulting event. “Points” in the figure represents the single score corresponding to each variable under different values, and “total points” represents the total score of the single score corresponding to the value of all variables. Finally, through the function conversion relationship between the total points and the probability of occurrence of outcome events, the corresponding probability of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery.

Table 4

Multivariate logistic regression analysis results in the training cohort

Variables	Adjusted OR (95% CI)	P
Daily smoking	2.934 (1.976–4.255)	<0.001
History of diabetes	2.629 (1.826–3.563)	<0.001
DLCO%Pred	1.913 (1.308–2.927)	<0.001
Maximum tumor diameter	0.851 (0.735–0.917)	0.004
Raw%Pred	0.102 (0.853–0.126)	0.002
Perioperative ONS supplement	1.135 (0.863–1.154)	<0.001
Postoperative urinary catheter	1.532 (1.176–1.724)	<0.001
Pleural adhesion degree	0.824 (0.719–0.905)	0.004

CI, confidence interval; DLCO%Pred, diffusing capacity of the lung for carbon monoxide percentage of predicted; ONS, oral nutritional supplements; OR, odds ratio; Raw%Pred, airway resistance percentage of predicted.

Figure 5 Nomogram for estimating postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery. DLCO%Pred, diffusing capacity of the lung for carbon monoxide percentage of predicted; ONS, oral nutritional supplements; Raw%Pred, airway resistance percentage of predicted.

Discussion

Artificial intelligence has made notable progress with its applications at the medical field. Machine learning is an important branch of artificial intelligence that can process cumbersome clinical data, independently learn, and make accurate predictions (15). The traditional scoring model relies on the scoring of parameters through logistic regression analysis, which has the advantage of strong model interpretation, but the technique fails to handle the interactions between complex variables and is prone to underfitting. Machine learning classification can avoid these shortcomings (16). At present, there are few studies based on machine learning algorithms that systematically explore the risk factors for the development of pulmonary infection in a specific group of patients who underwent high-risk surgery. The establishment of risk prediction models based on machine learning algorithm is of great significance for rapidly and accurately identifying high-risk factors and timely taking targeted preventive measures, reducing the occurrence of postoperative pulmonary infection.

Research confirms that machine learning algorithms have become a common patient risk management aid in clinical care (17-19). At present, machine learning has been initially applied in the field of postoperative pulmonary infection risk management in surgical patients, including the risk prediction of postoperative pulmonary infection in patients with complex surgical operations (20-23). The research results have shown that with the help of machine learning algorithms, nurses can take more accurate preventive measures to deal with the postoperative complications of their patients. This enables early detection and early treatment, reducing healthcare costs and improving outcomes without increasing the burden on health care professionals. Therefore, this study used five machine learning algorithms to allow computers to learn from various clinical data of patients, such as hospital information system, bedside monitors, and nursing records. We used 2,650 clinical data to develop and internally validate postoperative pulmonary infection risk prediction model of lung cancer patients who underwent high-risk surgery.

In our study, the AUC of the five models ranged from 0.893 to 0.936. In our study, the sensitivity range of the five models ranged from 0.903 to 0.931, and the PPV ranged from 0.846 to 0.885, indicating that it was well targeted for high-risk patients with postoperative pulmonary infection. At the same time, in our study, the specificity range of the five models ranged from 0.817 to 0.887, indicating that these models could better distinguish postoperative pulmonary infection with other similar diseases, and reduce the incidence of misdiagnosis. In this study, the overall incidence of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery was 9.50%, and these patients faced negative effects such as prolonged hospital stays, increased medical costs, and increased the risk of death. Therefore, the aim of this study was using machine learning algorithms to capture as many potential cases of postoperative pulmonary infection as possible, carrying out targeted intervention on the high-risk factors of patients, thus minimizing the possibility of postoperative pulmonary infection. In this study, higher daily smoking, history of diabetes, worse preoperative diffusion of lung, higher preoperative airway resistance, larger maximum tumor diameter, no perioperative ONS supplement, postoperative urinary catheter, and higher pleural adhesion degree were the risk factors for postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery, suggesting that we should pay attention to the above aspects in clinical work.

These five models have the potential to be effective tools to assist in the clinical management of pulmonary infections in lung cancer patients who underwent high-risk surgery, and their capabilities will eventually improve as clinical data accumulate and more input features are added. XGB was widely recognized for many machine learning and data mining challenges (24,25). It was worth noting that in the internal verification process, we found that the XGB model had the best prediction. The performance of the XGB model was better than that of the other four models, and it has good discrimination and calibration effects. The main reason for this outcome is that XGB algorithm with superior learning and generalization ability, it can handle large datasets and high-dimensional features effectively and optimize model performance by using gradient lifting to provide accurate predictions with a small generalization error. It also has unique advantages in dealing with high-dimensional variables and complex interactions and nonlinear relations between the variables. One of the important implications of this study is to integrate the established prediction model into the electronic system of medical institutions to adapt to the future development trend of the medical industry. When the patient’s medical and nursing data is uploaded to the system, the prediction model is synchronously machine learned to provide automatic, real-time, efficient, and intelligent early warning of pulmonary infection after lung cancer high-risk surgery, hoping to assist clinicians in early assessment and judgment of the patient’s status.

However, we must face the reality that further development is needed to translate machine learning classifiers into clinical or other real-world settings. For example, some medical units do not have a complete clinical data management system, and some clinical data of patients may not be directly obtained from the system. In order to improve the operability, universality, and clinical promotion of risk prediction model in clinical work, this study converted LR model with good performance into a nomogram. We hoped that the nomogram can make up for the non-visual shortcomings of machine learning, help medical stuff to make initial judgments on patients in limited situations such as inadequate medical conditions. In addition, the nomogram also has the advantages of convenience and timeliness, medical staff can quickly capture patients with high risk of postoperative pulmonary infection under various medical conditions by using the nomogram (26,27), providing data support for the accurate formulation of targeted prevention strategies for postoperative pulmonary infection.

Conclusions

In this study, we established five machine learning-based risk prediction models and a risk prediction nomogram for lung cancer patients who underwent high-risk surgery, which can efficiently help medical staff predict the probability of postoperative pulmonary infection. There are some limitations in our study. First, the data in our study came from a single large academic medical center, the model performance when applied to other medical institutions is unclear, and we lacked multi-center external data for model validation. We hope that in the future, we can collect enough external independent data, which sets from other multi-center cooperative institutions at different times and places, so as to fully test the model in terms of time verification and geographical verification. Secondly, the study does not account for temporal changes in risk factors postoperatively, which is one of the limitations of our study. In the future, we will use recurrent neural networks (RNN) or long short-term memory (LSTM) networks to integrating time-series models, optimize the models, and adjust parameters to further improve the prediction accuracy of the models. And then, whether family participation information, biomarkers, and other potential influencing factors have an impact on the risk of postoperative pulmonary infection in lung cancer patients who underwent high-risk surgery remains to be further explored. Finally, some experts pointed out that the predictive performance of the model can be further improved by adopting integrated modeling (28). We will integrate machine learning algorithms in subsequent studies to optimize the model, and alongside its integration into the hospital’s electronic system. We hope to focus on the above issues in the next research.

Acknowledgments

The authors thank all participants of the present study as well as all members of staff of the study for their role in data collection.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2024-2276/rc

Data Sharing Statement: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2024-2276/dss

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2024-2276/prf

Funding: This work was supported by the Shanghai Jiao Tong University School of Medicine: Nursing Development Program, Shanghai 2024 “Science and Technology Innovation Action Plan” Science Popularization Special Project (No. 24DZ2300700), the 2024 Shanghai Health System Young Talent Award Foundation’s First Jahwa-Nursing Special Technology Support Project, and the 2024 Shanghai Hospital Development Center Municipal Hospital Diagnosis and Treatment Technology Promotion and Optimization Management Project (No. SHDC22024225).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2024-2276/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Committee of Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine (No. KS23016). Given the retrospective nature of the study, patient’s informed consent was considered an exemption.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Filho AM, Laversanne M, Ferlay J, et al. The GLOBOCAN 2022 cancer estimates: Data sources, methods, and a snapshot of the cancer burden worldwide. Int J Cancer 2025;156:1336-46. [Crossref] [PubMed]
Qi J, Li M, Wang L, et al. National and subnational trends in cancer burden in China, 2005-20: an analysis of national mortality surveillance data. Lancet Public Health 2023;8:e943-55. [Crossref] [PubMed]
Chen S, Cao Z, Prettner K, et al. Estimates and Projections of the Global Economic Cost of 29 Cancers in 204 Countries and Territories From 2020 to 2050. JAMA Oncol 2023;9:465-72. [Crossref] [PubMed]
Riely GJ, Wood DE, Ettinger DS, et al. Non-Small Cell Lung Cancer, Version 4.2024, NCCN Clinical Practice Guidelines in Oncology. J Natl Compr Canc Netw 2024;22:249-74. [Crossref] [PubMed]
Ma S, Li F, Li J, et al. Risk factor analysis and nomogram prediction model construction of postoperative complications of thoracoscopic non-small cell lung cancer. J Thorac Dis 2024;16:3655-67. [Crossref] [PubMed]
Karalapillai D, Weinberg L, Peyton P, et al. Effect of Intraoperative Low Tidal Volume vs Conventional Tidal Volume on Postoperative Pulmonary Complications in Patients Undergoing Major Surgery: A Randomized Clinical Trial. JAMA 2020;324:848-58. [Crossref] [PubMed]
Wang Y, Li J, Wu Q, et al. Pathogen distribution in pulmonary infection in chinese patients with lung cancer: a systematic review and meta-analysis. BMC Pulm Med 2023;23:402. [Crossref] [PubMed]
Semenkovich TR, Frederiksen C, Hudson JL, et al. Postoperative Pneumonia Prevention in Pulmonary Resections: A Feasibility Pilot Study. Ann Thorac Surg 2019;107:262-70. [Crossref] [PubMed]
Zhang C, Fu Y, Chen Q, et al. Risk factors for postoperative pulmonary infections in non-small cell lung cancer: a regression-based nomogram prediction model. Am J Cancer Res 2024;14:5365-77. [Crossref] [PubMed]
Zhu H, Lu H. Effects of bronchoscopic alveolar lavage-assisted mechanical ventilation on postoperative pulmonary infection and inflammatory factors in patients undergoing lung cancer surgery. Wideochir Inne Tech Maloinwazyjne 2024;19:347-55. [Crossref] [PubMed]
Ding Z, Wang X, Jiang S, et al. Risk factors for postoperative pulmonary infection in patients with non-small cell lung cancer: analysis based on regression models and construction of a nomogram prediction model. Am J Transl Res 2023;15:3375-84.
Wood DE, Kazerooni EA, Aberle D, et al. NCCN Guidelines® Insights: Lung Cancer Screening, Version 1.2022. J Natl Compr Canc Netw 2022;20:754-64. [Crossref] [PubMed]
Shi Y, Huang Y, Zhang TT, et al. Chinese guidelines for the diagnosis and treatment of hospital-acquired pneumonia and ventilator-associated pneumonia in adults (2018 Edition). J Thorac Dis 2019;11:2581-616. [Crossref] [PubMed]
Janssen KJ, Moons KG, Kalkman CJ, et al. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 2008;61:76-86. [Crossref] [PubMed]
Leeming J. How AI is helping the natural sciences. Nature 2021;598:5-7.
Cecen B, Topates G, Kara A, et al. Biocompatibility of silicon nitride produced via partial sintering & tape casting. Ceram Int 2021;47:3938-45.
Shelley B, Shaw M. Machine learning and preoperative risk prediction: the machines are coming. Br J Anaesth 2024;133:925-30. [Crossref] [PubMed]
Wang J, Chen H, Wang H, et al. A Risk Prediction Model for Physical Restraints Among Older Chinese Adults in Long-term Care Facilities: Machine Learning Study. J Med Internet Res 2023;25:e43815. [Crossref] [PubMed]
Zheng Y, Zhang C, Liu Y. Risk prediction models of depression in older adults with chronic diseases. J Affect Disord 2024;359:182-8. [Crossref] [PubMed]
Li MP, Liu WC, Wu JB, et al. Machine learning for the prediction of postoperative nosocomial pulmonary infection in patients with spinal cord injury. Eur Spine J 2023;32:3825-35. [Crossref] [PubMed]
Lu C, Xing ZX, Xia XG, et al. Development and validation of a postoperative pulmonary infection prediction model for patients with primary hepatic carcinoma. World J Gastrointest Oncol 2023;15:1241-52. [Crossref] [PubMed]
Liu J, Li X, Wang Y, et al. Predicting postoperative pulmonary infection in elderly patients undergoing major surgery: a study based on logistic regression and machine learning models. BMC Pulm Med 2025;25:128. [Crossref] [PubMed]
Wu X, Zhang H, Cai M, et al. Predicting the risk of pulmonary infection after kidney transplantation using machine learning methods: a retrospective cohort study. Int Urol Nephrol 2025;57:947-55. [Crossref] [PubMed]
Kumar H, Agarwal R, Yadav A. Bio-inspired gloden jackal optimization of XGBoost model enhances 30-day sepsis mortality predictions. J Crit Care 2025;87:155013. [Crossref] [PubMed]
Zheng L, Xue YJ, Yuan ZN, et al. Explainable SHAP-XGBoost models for pressure injuries among patients requiring with mechanical ventilation in intensive care unit. Sci Rep 2025;15:9878. [Crossref] [PubMed]
Wang P, Fang E, Zhao X, et al. Nomogram for soiling prediction in postsurgery hirschsprung children: a retrospective study. Int J Surg 2024;110:1627-36. [Crossref] [PubMed]
Fang F, Liu T, Li J, et al. A novel nomogram for predicting the prolonged length of stay in post-anesthesia care unit after elective operation. BMC Anesthesiol 2023;23:404. [Crossref] [PubMed]
Zhang Z, Chen L, Xu P, et al. Predictive analytics with ensemble modeling in laparoscopic surgery: a technical note. Laparosc Endosc Robot Surg 2022;5:25-34.

Cite this article as: Ma J, Xue B, Zhang Z, Yao L, Liu X. Incidence, risk factors, and predictive modeling of pulmonary infection after high-risk surgery for lung cancer: a retrospective case-control study. J Thorac Dis 2025;17(6):3702-3715. doi: 10.21037/jtd-2024-2276

Incidence, risk factors, and predictive modeling of pulmonary infection after high-risk surgery for lung cancer: a retrospective case-control study

Highlight box

Introduction

Background

Rationale and knowledge gap

Objective

Methods

Study design and ethics

Participants

Diagnosis of postoperative pulmonary infection

Data collection

Missing data

Statistical analysis

Results

Cohort comparison

Table 1

Screening of risk factors

Table 2

Construction of prediction model based on five machine learning algorithms

Table 3

The XGB model explained via the SHAP method

Establishment of postoperative pulmonary infection nomogram model for lung cancer patients who underwent high-risk surgery

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share