Fusion of 2.5D deep transfer learning and radiomics for predicting benign and malignant Lung Imaging Reporting and Data System (Lung-RADS) 3 and 4A nodules
Original Article

Fusion of 2.5D deep transfer learning and radiomics for predicting benign and malignant Lung Imaging Reporting and Data System (Lung-RADS) 3 and 4A nodules

Kui Wang1,2, Xiaoxiao Huang1,3, Xiaoxin Huang1,4, Xingyu Mu1,5, Lijuan Liu1, Guanqiao Jin1

1Department of Radiology, Guangxi Medical University Cancer Hospital, Nanning, China; 2Life Sciences Research Institute, Guangxi Medical University, Nanning, China; 3Department of Radiology, Affiliated Hospital of Youjiang Medical University for Nationalities, Baise, China; 4Department of Radiology, Jiangbin Hospital of Guangxi Zhuang Autonomous Region, Nanning, China; 5Department of Nuclear Medicine, Affiliated Hospital of Guilin Medical University, Guilin, China

Contributions: (I) Conception and design: G Jin; (II) Administrative support: K Wang, G Jin; (III) Provision of study materials or patients: K Wang, X Huang, X Huang; (IV) Collection and assembly of data: K Wang, L Liu; (V) Data analysis and interpretation: K Wang, X Mu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Guanqiao Jin, MD, PhD. Department of Radiology, Guangxi Medical University Cancer Hospital, 50 Liangyu Avenue, Liangqing District, Nanning 530021, China. Email: jinguanqiao77@gxmu.edu.cn.

Background: Pulmonary nodules categorized as Lung Imaging Reporting and Data System (Lung-RADS) 3 and 4A constitute a substantial proportion of radiologically indeterminate lesions, and current imaging criteria demonstrate limited accuracy for distinguishing benign from malignant nodules. To develop and validate an integrated predictive model combining radiomics and two-and-a-half (2.5D) deep transfer learning (DTL) for differentiating benign from malignant pulmonary nodules classified as Lung-RADS categories 3 and 4A on computed tomography (CT).

Methods: This retrospective study included 298 patients with Lung-RADS 3 and 4A nodules from three centers. The cohort from Center 1 (n=247) was divided into training (n=172) and test (n=75) sets, while patients from Centers 2 and 3 (n=51) formed an independent validation set. We constructed three models: a radiomics (Rad) model, a DTL model, and an integrated deep transfer radiomics (DTR) model. Model performance was evaluated using the area under the receiver operating characteristic curve, and clinical utility was assessed using decision curve analysis.

Results: The DTR model demonstrated superior performance in the training [area under the curve (AUC): 0.975, 95% confidence interval (CI): 0.9529–0.9981], testing (AUC: 0.851, 95% CI: 0.7328–0.9683), and external validation cohorts (AUC: 0.727, 95% CI: 0.5876–0.8663), outperforming both the Rad model (training: AUC: 0.743; testing: AUC: 0.642; validation: AUC: 0.613) and the DTL model (training: AUC: 0.843; testing: AUC: 0.757; validation: AUC: 0.701). The DTR model exhibited high specificity (0.895) and positive predictive value (0.956) in the test cohort. SHapley Additive exPlanations (SHAP) analysis revealed that the DTR model effectively leveraged complementary features from both Rad and DTL.

Conclusions: The integration of Rad and 2.5D DTL significantly improves the diagnostic accuracy for differentiating between benign and malignant Lung-RADS 3 and 4A nodules. This approach provides a robust decision-support tool that could potentially reduce unnecessary interventions for benign nodules while facilitating earlier detection and treatment of malignant lesions.

Keywords: Pulmonary nodules; radiomics (Rad); deep transfer learning (DTL); 2.5D Model; malignancy potential prediction


Submitted Mar 19, 2025. Accepted for publication Jul 04, 2025. Published online Oct 29, 2025.

doi: 10.21037/jtd-2025-584


Highlight box

Key findings

• By integrating radiomic features with 2.5D deep transfer learning features, the diagnostic accuracy for distinguishing benign and malignant pulmonary nodules in Lung Imaging Reporting and Data System (Lung-RADS) categories 3 and 4A was significantly enhanced.

What is known and what is new?

• Radiomics and deep learning enhance pulmonary nodule image analysis by extracting quantifiable features beyond human visual assessment.

• The Lung-RADS classification system standardizes pulmonary nodule diagnosis and guides clinical management, particularly for indeterminate category 3/4A nodules where multi-feature fusion methods demonstrate superior diagnostic performance, outperforming conventional models in multi-center validations.

What is the implication, and what should change now?

• This study developed a novel model specifically designed to differentiate between benign and malignant pulmonary nodules in Lung-RADS categories 3 and 4A, aiming to enhance the quality of clinical decision-making. By integrating advanced technologies into the current clinical workflow, the model seeks to improve diagnostic accuracy and reliability.


Introduction

Lung cancer is the most prevalent malignant neoplasm worldwide and remains the leading cause of cancer-related mortality globally (1,2). This high mortality rate is primarily attributed to delayed diagnosis, as the majority of patients present with advanced-stage disease, where therapeutic options are limited. Consequently, the five-year survival rate remains relatively low at approximately 19.7% (3). In contrast, patients diagnosed with early-stage lung cancer who undergo curative surgical intervention exhibit significantly improved survival outcomes compared to those with advanced-stage disease (4).

However, early detection of lung cancer continues to pose a significant clinical challenge, particularly in the assessment of indeterminate pulmonary nodules categorized as Lung Imaging Reporting and Data System (Lung-RADS) 3–4A on computed tomography (CT) imaging. These nodules carry a malignancy probability ranging from 1% to 15% (5,6). In practice, such nodules often necessitate additional diagnostic testing, invasive procedures, or frequent follow-ups, which may increase patient burden and potentially delay timely treatment.

With rapid advancements in medical imaging and artificial intelligence technologies, radiomics and deep learning have emerged as promising tools for the evaluation of lung nodules, providing new avenues for improving diagnostic accuracy (7-9). Despite these developments, existing radiomics-based studies face several limitations. Radiomic features are susceptible to various influencing factors and demonstrate considerable variability, which may adversely affect diagnostic precision. Recent studies have highlighted the potential of 2.5D deep learning models to effectively capture spatial information and contextual relationships within imaging data (10,11). Unlike 2D deep learning, which is prone to overfitting when applied to small datasets, and 3D deep learning, which demands extensive computational resources, 2.5D deep learning offers a balance—leveraging spatial context while maintaining computational efficiency (12). Therefore, 2.5D deep learning presents itself as a valuable complement to radiomics, enhancing the discrimination of Lung-RADS category 3 and 4A nodules.

To overcome the limitations of current diagnostic strategies, this study was designed to develop and validate an integrated predictive model that synergistically combines radiomics features with 2.5D deep transfer learning (DTL).

This innovative framework is designed to improve diagnostic accuracy in differentiating between Lung-RADS 3 and 4A pulmonary nodules, ultimately facilitating earlier and more precise lung cancer detection. By harnessing the complementary strengths of advanced computational techniques, the proposed approach aspires to offer clinicians a robust decision-support tool, thereby improving patient outcomes through timely and effective intervention. We present this article in accordance with the TRIPOD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-584/rc).


Methods

Patients

This retrospective study was approved by the institutional review board, with informed consent waived. Data were collected from three centers: Guangxi Medical University Cancer Hospital (Center 1), the First People’s Hospital of Nanning (Center 2), and Affiliated Hospital of Youjiang Medical University for Nationalities (Center 3). Between March 2021 and October 2023, 247 patients from Center 1 were enrolled and randomly assigned to a training set and test set in a 7:3 ratio. Additionally, 35 patients were recruited from Center 2 (December 2020 to October 2023) and 16 from Center 3 (August 2022 to August 2023); these were combined to form an external validation set. For each patient, only one pulmonary nodule was selected for analysis, resulting in a total of 298 nodules. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics committee of Guangxi Medical University Cancer Hospital (ethical approval No. KY-2022-301) and individual consent for this retrospective analysis was waived. All participating hospitals were informed and agreed to the study.

Inclusion criteria were: (I) histopathologically confirmed diagnosis; (II) nodules classified as Lung-RADS category 3 or 4A by two diagnostic physicians according to the American College of Radiology (ACR) Lung-RADS classification system; and (III) availability of complete imaging data. Exclusion criteria included: (I) biopsy performed prior to CT examination; (II) lung window thin-section thickness exceeding 2 mm; and (III) nodules that exceeded the upper size limits for Lung-RADS category 4A: solid nodules with a maximum diameter >8 mm, part-solid nodules with a solid component >8 mm, or non-solid nodules with an overall diameter >30 mm.

Image processing and segmentation

CT images were acquired from various hospitals and imaging platforms. To standardize imaging data, voxel spacing was resampled to 1 mm × 1 mm × 1 mm. The CT window width and level were adjusted to 1,200 Hounsfield Units (HU) and −350 HU, respectively.

Two radiologists, each with five years of experience, manually delineated the region of interest (ROI) for each lesion on a layer-by-layer basis using 3D Slicer software (version 4.8.1), blinded to patients’ clinical and pathological information. Discrepancies were resolved through consensus with a senior radiologist possessing over ten years of experience.

Radiomics feature extraction and selection

Radiomics features were extracted from the segmented ROIs using PyRadiomics (version 3.0) (13). The extracted features encompassed morphological, first-order histogram, and texture features (14).

To ensure reproducibility, the intraclass correlation coefficient (ICC) was used to evaluate inter-observer consistency. Thirty cases were randomly selected, and two experienced physicians independently outlined ROIs and extracted features. Features with ICC ≥0.75 were retained. Redundant features with a Pearson correlation coefficient >0.9 were eliminated to reduce multicollinearity.

Subsequently, the least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation was applied to further select the most informative features and prevent overfitting, ensuring model stability.

DTL data preparation

For each patient, the slice containing the largest ROI cross-sectional area was selected as the central image. To reduce computational complexity and minimize background noise, only the smallest bounding rectangle surrounding the ROI was retained.

In addition to the central slice, adjacent slices at intervals of ±1 and ±2 layers along the superior-inferior and anterior-posterior axes were extracted, yielding a total of five two-dimensional images per patient. Features were extracted independently from each slice and subsequently integrated to form 2.5D fused features, preserving spatial contextual information.

DTL model training and feature extraction

Several well-established deep learning architectures (ResNet18/50/101 and DenseNet121) were evaluated through comparative analysis. Cross-entropy loss metrics and Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations were utilized to identify the optimal architecture.

Feature extraction was performed from the penultimate layer of the selected deep learning model, which encapsulates high-level semantic features while minimizing model overfitting. To further reduce dimensionality, principal component analysis (PCA) was applied, enhancing model efficiency and interpretability.

Model construction

Feature selection using LASSO was applied to construct both the radiomics (Rad) model and the DTL model. Support Vector Machines (SVM) were employed for classification, given their effectiveness in handling high-dimensional data and modeling non-linear relationships (15). For the integrated deep transfer radiomics (DTR) model, deep learning features and radiomics features were fused using a pre-fusion strategy. The fused feature set underwent feature selection and model construction similar to the Rad model, with SVM used as the classifier. In this study, we optimized the hyperparameters for LASSO and SVM. For LASSO, the optimal regularization parameter was identified via 10-fold cross-validation, effectively assessing model generalizability and balancing complexity with prediction accuracy. For SVM, we set probability = True, max_iter = 111, and employed a linear kernel function(kernel = ‘linear’). These settings aimed to balance training efficiency and prediction performance, ensuring rapid model convergence and effective handling of linearly separable data. The overall workflow of radiomics and DTL analysis is illustrated in Figure 1.

Figure 1 Workflow of the study. The process begins with manual segmentation of the ROI corresponding to lung nodules on CT images. Radiomics features are extracted from the segmented ROIs, while deep transfer learning features are derived from each layer of the selected neural network model. These two feature sets are then integrated through feature fusion. To optimize the feature space, LASSO is applied to retain the most discriminative features. Subsequently, interpretability analysis is performed to elucidate the model’s decision-making process. The predictive performance of the model is comprehensively evaluated using receiver operating characteristic curves, calibration curves, and decision curve analysis to assess its robustness, reliability, and clinical utility. AUC, area under the curve; CT, computed tomography; DCA, decision curve analysis; DL, deep learning; Grad-CAM, Gradient-weighted Class Activation Mapping; LASSO, least absolute shrinkage and selection operator; MSE, mean squared error; ROI, region of interest; SHAP, SHapley Additive exPlanations.

Statistical analysis

The normality of clinical variables was assessed using the Shapiro-Wilk test. Continuous variables were analyzed using either the independent samples t-test or the Mann-Whitney U test, depending on the distribution characteristics. Categorical variables were compared using Chi-square tests. P values greater than 0.05 across all cohorts indicated no significant differences, confirming that the patient groups were well-balanced without selection bias.

To rigorously evaluate the diagnostic performance of our models, receiver operating characteristic (ROC) curves were plotted for both the test and validation cohorts, assessing discriminative capability. Calibration curves were generated to evaluate the agreement between predicted probabilities and actual outcomes, thereby assessing model calibration. Furthermore, decision curve analysis (DCA) was performed to determine the clinical utility and net benefit of each predictive model.

Additional performance metrics—including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV)—were also calculated to provide a comprehensive evaluation of the model’s diagnostic accuracy.


Results

Patient characteristics

A total of 298 patients were included in this study, with a mean age of 54.21±11.45 years. The cohort from Center 1 consisted of 247 patients, including 176 cases with malignant nodules and 71 with benign nodules. These patients were randomly divided into a training set (n=172) and a test set (n=75) in a 7:3 ratio. Additionally, Center 2 contributed 35 patients (23 malignant and 12 benign cases), and Center 3 contributed 16 patients (9 malignant and 7 benign cases); these were combined to form the independent validation set.

Statistical analysis revealed significant differences in age, history of other lung diseases, and nodule density among the patient populations. However, no significant differences were observed in gender, smoking history, nodule position, or history of malignancy. Detailed baseline characteristics of all cohorts are summarized in Table 1.

Table 1

Clinical characteristics of enrolled patients

Characteristics All (n=298) Validation (n=51) Training (n=172) Test (n=75) P value
Age (years) 54.21±11.45 57.78±11.87 53.33±10.90 53.80±12.06 0.047
Nodule maximum diameter (mm) 11.81±4.34 11.84±3.74 11.98±4.54 11.40±4.26 0.63
Gender 0.86
   Female 168 (56.38) 27 (52.94) 98 (56.98) 43 (57.33)
   Male 130 (43.62) 24 (47.06) 74 (43.02) 32 (42.67)
Smoking 0.61
   No 246 (82.55) 40 (78.43) 142 (82.56) 64 (85.33)
   Yes 52 (17.45) 11 (21.57) 30 (17.44) 11 (14.67)
Nodule position 0.13
   Upper right 102 (34.23) 22 (43.14) 55 (31.98) 25 (33.33)
   Middle right 26 (8.72) 4 (7.84) 16 (9.30) 6 (8.00)
   Lower right 54 (18.12) 10 (19.61) 23 (13.37) 21 (28.00)
   Upper left 75 (25.17) 9 (17.65) 50 (29.07) 16 (21.33)
   Lower left 41 (13.76) 6 (11.76) 28 (16.28) 7 (9.33)
Other lung diseases 0.02
   No 223 (74.83) 46 (90.20) 123 (71.51) 54 (72.00)
   Yes 75 (25.17) 5 (9.80) 49 (28.49) 21 (28.00)
History of malignancy 0.49
   No 274 (91.95) 49 (96.08) 157 (91.28) 68 (90.67)
   Yes 24 (8.05) 2 (3.92) 15 (8.72) 7 (9.33)
Density <0.001
   Solid 111 (37.25) 33 (64.71) 59 (34.30) 19 (25.33)
   Subsolid 116 (38.93) 10 (19.61) 69 (40.12) 37 (49.33)
   Ground-glass opacity 71 (23.83) 8 (15.69) 44 (25.58) 19 (25.33)

Data are presented as mean ± standard deviation for continuous variables and n (%) for categorical variables.

Deep model selection

In DTL, the choice of loss function is critical for quantifying the discrepancy between predicted and actual values, thereby guiding model parameter optimization (16). We pre-trained four deep learning models using the largest ROI cross-sectional images and compared their respective loss values. Among the evaluated models, ResNet18 demonstrated the lowest loss value, indicating reduced learning errors and faster convergence during training. Moreover, Grad-CAM visualizations confirmed that ResNet18 exhibited superior focus on nodule-centered regions within the largest ROI cross-sections (Figure 2). Based on these findings, ResNet18 was selected for subsequent analyses.

Figure 2 Comparison of loss values and Grad-CAM visualizations across four deep transfer learning models: ResNet18, ResNet50, ResNet101, and DenseNet121. (A) Loss value curves for each model, with smaller loss values indicating better model performance and faster convergence. (B) Grad-CAM visualization showing the model’s focus on the nodule-centered regions. Grad-CAM, Gradient-weighted Class Activation Mapping.

Feature selection

A total of 1,834 radiomics features were initially extracted from the segmented imaging data. Employing the LASSO method, we conducted rigorous feature selection to identify the most relevant predictors. This process reduced the feature set to 19 radiomics features with the strongest associations to the outcome of interest.

For DTL features, dimensionality reduction was applied at each neural network layer, retaining 32 features per layer. After aggregating features across all five layers, a total of 160 DTL features were obtained. Subsequent refinement using LASSO further reduced this set to 21 features, ensuring that only the most informative features were preserved.

By integrating the radiomics and DTL features, we generated a combined feature set consisting of 1,994 features, referred to as DTR features. To optimize this dataset for model development, LASSO was applied once more, ultimately retaining 24 features with the highest predictive value.

Feature importance analysis using SHapley Additive exPlanations (SHAP)

Based on the selected features, three predictive models were developed: the Rad model, the DTL model, and the DTR model, which integrates both radiomics and DTL features. To elucidate each feature’s contribution to the model’s predictions, SHAP analysis was performed on the DTR model (Figure 3).

Figure 3 SHAP value distribution and feature importance analysis illustrating the contribution of individual features to the model's predictions. (A) SHAP summary plot providing an overview of feature contributions across the entire dataset, with color gradients representing feature values (red indicates higher feature values, blue indicates lower feature values). (B) Feature importance plot displaying the average influence of each feature on the model’s output, with features ranked by their SHAP values. (C) SHAP plots for two representative patients, showing the distribution and impact of specific features on individual predictions, with arrows indicating key features that significantly influenced the model's predictions. SHAP, SHapley Additive exPlanations.

Figure 3A illustrates the distribution of SHAP values for each feature, with color gradients indicating feature values from low (blue) to high (red). These SHAP values provide a clear visualization of each feature’s influence on the model’s predictions. The analysis revealed that the DTR model relies on a combination of radiomics and DTL features for decision-making.

Figure 3B displays the feature importance rankings, where features such as DTL_9.4 and log_sigma_2_0_mm_3D_firstorder_Skewness demonstrated the highest contributions, underscoring the synergistic value of integrating both modalities. This comprehensive understanding of feature importance provides a foundation for further model refinement and clinical interpretability. Additionally, individual patient-level SHAP analyses were conducted. For example, in patient 1, the SHAP value of 0.36 was lower than the baseline value of 0.709, suggesting a benign diagnosis, with feature arrows visually illustrating the quantitative contributions of key predictors (Figure 3C). Conversely, patient 2 exhibited a SHAP value of 0.947, significantly higher than the baseline, corresponding to a malignant prediction.

Model performance and comparative analysis

The diagnostic performance of the three models—Rad, DTL, and DTR—was systematically evaluated using the area under the curve (AUC) across the training, testing, and external validation cohorts. In the training cohort, the DTR model achieved the highest discriminative capability with an AUC of 0.975 [95% confidence interval (CI): 0.9529–0.9981], significantly outperforming the Rad model (AUC: 0.743, 95% CI: 0.6597–0.8262) and the DTL model (AUC: 0.843, 95% CI: 0.7773–0.9081). This substantial improvement underscores the benefit of integrating radiomics with DTL features, likely due to the combination of macroscopic texture descriptors and hierarchical spatial dependencies. In the test cohort, the DTR model maintained robust performance with an AUC of 0.851 (95% CI: 0.7328–0.9683), exceeding the DTL model (AUC: 0.757, 95% CI: 0.6223–0.8909) and Rad model (AUC =0.642, 95% CI: 0.4832–0.8007) by margins of 0.094 and 0.209, respectively. Notably, the DTR model demonstrated high specificity (0.895) and PPV (0.956), indicating strong potential for reducing unnecessary interventions in benign cases. In the external validation cohort, the DTR model achieved an AUC of 0.727 (95% CI: 0.5876–0.8663), outperforming both the Rad model (AUC: 0.613, 95% CI: 0.4522–0.7747) and the DTL model (AUC: 0.701, 95% CI: 0.5374–0.8639). However, a moderate decline in performance compared to the test cohort was observed, potentially reflecting domain shifts in nodule characteristics across different institutions.

Overall, the DTR model consistently demonstrated superior diagnostic performance across all cohorts. The relatively suboptimal results of the Rad model align with prior criticisms of traditional radiomics, which highlight challenges related to feature redundancy and limited generalizability. While the DTL model exhibited intermediate performance, its standalone application may fail to capture critical textural information effectively addressed by radiomics. These findings are consistent with recent studies advocating for hybrid frameworks, reinforcing that the fusion of multimodal features enhances diagnostic robustness, particularly in small and heterogeneous datasets. The comparative effectiveness of the three models is detailed in Table 2 and illustrated in Figure 4.

Table 2

Model performance comparison across different datasets

Cohort Signature Accuracy AUC 95% CI Sensitivity Specificity PPV NPV Precision Recall F1 Threshold
Train Rad 0.779 0.743 0.6597–0.8262 0.892 0.519 0.811 0.675 0.811 0.892 0.849 0.622
DTL 0.791 0.843 0.7773–0.9081 0.792 0.788 0.896 0.621 0.896 0.792 0.841 0.691
DTR 0.942 0.975 0.9529–0.9981 0.950 0.923 0.966 0.889 0.966 0.950 0.958 0.565
Test Rad 0.800 0.642 0.4832–0.8007 0.946 0.368 0.815 0.700 0.815 0.946 0.876 0.492
DTL 0.733 0.757 0.6223–0.8909 0.750 0.684 0.875 0.481 0.875 0.750 0.808 0.683
DTR 0.800 0.851 0.7328–0.9683 0.768 0.895 0.956 0.567 0.956 0.768 0.851 0.778
Validation Rad 0.667 0.613 0.4522–0.7747 0.844 0.368 0.692 0.583 0.692 0.844 0.761 0.699
DTL 0.725 0.701 0.5374–0.8639 0.812 0.579 0.765 0.647 0.765 0.812 0.788 0.741
DTR 0.667 0.727 0.5876–0.8663 0.594 0.789 0.826 0.536 0.826 0.594 0.691 0.853

AUC, area under the curve; CI, confidence interval; DTL, deep transfer learning; DTR, deep transfer radiomics; NPV, negative predictive value; PPV, positive predictive value; Rad, radiomics.

Figure 4 ROC curves evaluating the diagnostic performance of the models in predicting malignancy in Lung-RADS category 3 and 4A pulmonary nodules across different datasets. (A) ROC curve for the training cohort, reflecting the model’s learning capability. (B) ROC curve for the test cohort, assessing predictive performance on unseen data. (C) ROC curve for the validation cohort, providing an independent evaluation of model generalizability. AUC, area under the curve; CI, confidence interval; DTL, deep transfer learning; DTR, deep transfer radiomics; Lung-RADS, Lung Imaging Reporting and Data System; Rad, radiomics; ROC, receiver operating characteristic.

Comparison of differences in model efficacy

In this study, we performed DeLong’s test across the training, testing, and validation sets to evaluate the AUC values of the three models and determine if there were significant differences in their ROC curves (Figure 5). The results, visually represented by color intensity where red indicates smaller P values and thus more significant differences, revealed the following:

Figure 5 DeLong’s test results of the three models across different datasets. (A) DeLong’s test results in the training set; (B) DeLong’s test results in the testing set; (C) DeLong’s test results in the external validation set. DTL, deep transfer learning; DTR, deep transfer radiomics; Rad, radiomics.

In the training set, the DTR model demonstrated a significant advantage over the DTL and Rad models (P<0.001). However, in the testing set, while DTR still outperformed Rad (P=0.03), its superiority was less pronounced than in the training set, and it no longer showed a significant difference compared to DTL (P=0.14). This suggests that the generalization ability of the DTR model may be constrained by data distribution or model complexity. In the validation set, no significant differences were found among DTR, Rad, and DTL, which might be due to differences in the distribution of samples across the models.

Clinical utility and model calibration

DCA is a valuable method for evaluating the clinical applicability of predictive models by quantifying the net benefit across varying threshold probabilities (17). In this study, the DTR model demonstrated the highest net benefit over a wide range of threshold probabilities, outperforming both the Rad and DTL models. These findings suggest that the DTR model holds significant potential to improve clinical decision-making by providing more accurate predictions regarding the malignancy risk of Lung-RADS category 3 and 4A pulmonary nodules. Additionally, calibration curves were employed to assess the agreement between the predicted probabilities and the actual outcomes (18). The calibration curves for the DTR model exhibited strong concordance, indicating that the model’s predictions are well-calibrated and reliable for clinical use. Figure 6 illustrates the DCA and calibration curves for all datasets, further confirming the superior clinical utility and dependable calibration of the DTR model.

Figure 6 Calibration curves and DCA assessing model calibration and clinical utility. DCA curves for the (A) training, (B) test, and (C) validation cohorts, demonstrating the net benefit of each model across a range of threshold probabilities. (D-F) Calibration curves for (D) training, (E) test, and (F) validation datasets, evaluating the agreement between predicted probabilities and actual outcomes. DCA, decision curve analysis; DTL, deep transfer learning; DTR, deep transfer radiomics; Rad, radiomics.

Discussion

This study presents a comprehensive predictive model that integrates radiomics with 2.5D DTL to enhance the accuracy of malignancy prediction for Lung-RADS category 3 and 4A pulmonary nodules. Our study demonstrated that the DTR model achieved an AUC of 0.975 and an accuracy of 0.942 in the training cohort, indicating that the fusion of radiomics and DTL features significantly improves the model’s discriminative ability—an essential aspect of diagnostic precision. This performance notably surpasses that of conventional Rad models, validating the effectiveness of DTR features in accurately predicting the malignant potential of lung nodules.

Radiomics, as a data-driven analytical methodology, has emerged as a powerful computational tool for disease diagnosis by extracting high-throughput quantitative features from medical imaging data (19-21). However, in our study, the standalone Rad model exhibited suboptimal discriminative capability, as reflected by consistently lower AUC values across all cohorts. This result suggests that radiomics alone may not adequately capture the complex spatial-temporal patterns associated with malignancy. DTL, in contrast, offers a paradigm shift by utilizing pre-trained convolutional neural networks for automated extraction of high-dimensional imaging features. This approach overcomes the limitations of radiomics, which depends heavily on hand-crafted feature engineering, thereby offering distinct advantages in identifying intricate imaging biomarkers (22).

To further optimize performance, we introduced a novel 2.5D feature fusion strategy. Specifically, the 2.5D model captures the continuous changes of lesions in three-dimensional space by fusing features from adjacent slices (central slice ±1, ±2 slices). This approach outperforms traditional radiomics methods, which typically rely on single-slice or full 3D volume analysis. In our study, the AUC value improved by 31.2% (from 0.743 to 0.975) compared to the pure 2D Rad model, underscoring the importance of spatial contextual information in disease diagnosis. Additionally, the 2.5D strategy reduces computational complexity by focusing on a limited number of slices, making the model more feasible for clinical application by providing timely predictions that align with clinical workflows. Moreover, the use of multi-slice sampling (central slice ±2 slices) generates more diverse training samples, alleviating the challenge of limited medical imaging data. This approach enables the model to capture both local texture features and global relational features, addressing the limitations of traditional radiomics in cross-slice feature extraction. Our results are consistent with prior findings by Zhu et al. (23), who demonstrated the superior performance of 2.5D deep learning models in classifying lung cancer brain metastases compared to 2D models.

Our 2.5D DTR model exhibited consistently strong performance in differentiating benign from malignant nodules categorized as Lung-RADS 3 and 4A. It achieved high accuracy, sensitivity, specificity, PPV, and NPV in the training dataset, and maintained stable performance with only minor reductions in the validation cohort. The decrease in AUC from testing (0.851) to external validation (0.727) highlights the issue of model generalization across institutions. Such differences may arise from differences in scanning protocols and patient populations, including differences in scanning equipment models, parameter settings, and imaging angles. These technical factors can affect image characteristics such as pixel resolution, gray level, and contrast, thereby impacting the model’s ability to extract and recognize image features. To solve this problem, in the future, we can establish a unified scanning standard, diversify the training data, apply data augmentation techniques, and conduct multi-center validation studies. The superior results can be attributed to the synergistic integration of radiomics and DTL features. This aligns with findings from Hu et al. (24), who reported improved diagnostic accuracy in distinguishing benign and malignant ground-glass nodules using a combined deep learning and radiomics approach. Similarly, Zhang et al. (25) highlighted the effectiveness of 2.5D frameworks, achieving an AUC of 0.795 in predicting hepatocellular carcinoma recurrence, further supporting the broad applicability of this methodology in oncology.

This study conducted DCA and calibration curve analyses. To facilitate the practical implementation of the DTR model in clinical workflows, the following suggestions are proposed: Firstly, integrate the DTR model into radiology reporting systems to provide real-time predictions of malignancy risks for clinicians, assisting them in deciding whether to perform further diagnostic procedures, such as biopsies or follow-up imaging. Secondly, use the model’s predictions for patient risk stratification to create more personalized management plans. For example, prioritize invasive procedures for high-risk patients while adopting conservative approaches for low-risk ones. Moreover, the DTR model’s cost-effectiveness potential is significant. By accurately identifying high-risk patients, it can reduce unnecessary surgeries in low-risk patients, lowering healthcare costs.

While deep learning has demonstrated powerful feature extraction capabilities and predictive performance in medical imaging (26,27), its “black box” nature raises concerns regarding interpretability and clinical trust (28,29). This is particularly critical in healthcare, where understanding the model’s rationale is essential for ensuring safety and reliability. To address this, we applied SHAP analysis to elucidate feature contributions. SHAP effectively revealed key features driving model predictions, such as DTL_9.4 and log_sigma_2_0_mm_3D_firstorder_Skewness, enhancing model interpretability and transparency (30). Integrating SHAP not only aids in optimizing the model but also provides essential insights to support clinical decision-making. However, offering a more in-depth interpretation of the significant impact these specific features have on predictive outcomes is essential for cultivating clinical trust and comprehension. For instance, log_sigma_2_0_mm_3D_firstorder_Skewness is likely to mirror the heterogeneity of voxel intensity within pulmonary nodules, which may correlate with malignant traits. Such features presumably capture crucial morphological or textural data about the nodules that play a significant role in distinguishing between benign and malignant cases. Malignant nodules are often characterized by more irregular intensity distributions. By comprehending the clinical significance of these features, clinicians can gain a better understanding of the model’s predictions and incorporate them into their decision-making processes. This thorough interpretation not only heightens the model’s transparency but also equips clinicians with vital insights into the pulmonary nodule characteristics critical for predicting malignancy.

In this study, we utilized a retrospective design to analyze historical case data for evaluating model performance. While this approach allows for efficient use of existing data and reduces cost and time, it is also susceptible to selection bias. To address this limitation and minimize its impact on future research, we propose the following strategies: First, during the sample collection phase, we collaborated with multiple healthcare institutions of varying levels to broaden the case sources and enhance the diversity of the samples. Second, in future studies, we will actively implement these mitigation strategies by further expanding the sample sources, applying advanced statistical methods to adjust for potential bias, and incorporating prospective study designs. These efforts aim to improve the generalizability and clinical applicability of the research findings, thereby strengthening the quality and credibility of the study.

Despite promising results, several limitations should be acknowledged. First, the relatively small sample size may restrict the generalizability of the model. Second, our study primarily focused on Lung-RADS categories 3 and 4A nodules; the model’s performance on other types of lung nodules remains unvalidated. Future research should focus on increasing the sample size. This can greatly boost the sample’s diversity and representativeness by recruiting more patients from various regions, races, and disease stages. Such an approach will enhance the model’s robustness, enabling it to better handle complex clinical scenarios. Moreover, it will strengthen the generalizability of the findings.


Conclusions

Our findings underscore the significant advantages of integrating radiomics with 2.5D DTL for predicting the malignancy of pulmonary nodules. The DTR model demonstrated superior diagnostic performance compared to both conventional Rad and DTL models. This improvement is primarily attributed to the synergistic fusion of deep learning and radiomics features, which expands the feature space and enhances the extraction of nuanced imaging characteristics critical for accurate diagnosis. By combining these complementary modalities, the DTR model captures a more comprehensive range of information, refining the diagnostic process. This methodological advancement holds substantial promise for clinical application, offering more precise and timely diagnostic insights that can improve treatment planning and patient outcomes.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-584/rc

Data Sharing Statement: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-584/dss

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-584/prf

Funding: This study was supported by Guangxi Provincial Key Research and Development Program (No. GuikeAB23026087).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-584/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics committee of Guangxi Medical University Cancer Hospital (ethical approval No. KY-2022-301) and individual consent for this retrospective analysis was waived. All participating hospitals were informed and agreed the study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Leiter A, Veluswamy RR, Wisnivesky JP. The global burden of lung cancer: current status and future trends. Nat Rev Clin Oncol 2023;20:624-39. [Crossref] [PubMed]
  2. Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
  3. Chinese expert consensus on diagnosis of early lung cancer (2023 Edition). Zhonghua Jie He He Hu Xi Za Zhi 2023;46:1-18. [Crossref] [PubMed]
  4. Stefani D, Plönes T, Viehof J, et al. Lung Cancer Surgery after Neoadjuvant Immunotherapy. Cancers (Basel) 2021;13:4033. [Crossref] [PubMed]
  5. Christensen J, Prosper AE, Wu CC, et al. ACR Lung-RADS v2022: Assessment Categories and Management Recommendations. Chest 2024;165:738-53. [Crossref] [PubMed]
  6. Mendoza DP, Petranovic M, Som A, et al. Lung-RADS Category 3 and 4 Nodules on Lung Cancer Screening in Clinical Practice. AJR Am J Roentgenol 2022;219:55-65. [Crossref] [PubMed]
  7. Sun Y, Ge X, Niu R, et al. PET/CT radiomics and deep learning in the diagnosis of benign and malignant pulmonary nodules: progress and challenges. Front Oncol 2024;14:1491762. [Crossref] [PubMed]
  8. de Margerie-Mellon C, Chassagnon G. Artificial intelligence: A critical review of applications for lung nodule and lung cancer. Diagn Interv Imaging 2023;104:11-7. [Crossref] [PubMed]
  9. Shi L, Sheng M, Wei Z, et al. CT-Based Radiomics Predicts the Malignancy of Pulmonary Nodules: A Systematic Review and Meta-Analysis. Acad Radiol 2023;30:3064-75. [Crossref] [PubMed]
  10. Song C, Zhao CY, Li K, et al. A 2.5D transfer deep learning model based on artificial intelligence for differentiating lymphoma and tuberculous lymphadenitis in HIV/AIDS patients. J Infect 2025;90:106439. [Crossref] [PubMed]
  11. Kumar A, Jiang H, Imran M, et al. A flexible 2.5D medical image segmentation approach with in-slice and cross-slice attention. Comput Biol Med 2024;182:109173. [Crossref] [PubMed]
  12. Avesta A, Hossain S, Lin M, et al. Comparing 3D, 2.5D, and 2D Approaches to Brain Image Auto-Segmentation. Bioengineering (Basel) 2023;10:181. [Crossref] [PubMed]
  13. Yu Y, Li GF, Tan WX, et al. Towards automatical tumor segmentation in radiomics: a comparative analysis of various methods and radiologists for both region extraction and downstream diagnosis. BMC Med Imaging 2025;25:63. [Crossref] [PubMed]
  14. Li H, Liang S, Cui M, et al. A preoperative pathological staging prediction model for esophageal cancer based on CT radiomics. BMC Cancer 2025;25:298. [Crossref] [PubMed]
  15. Mukherjee S, Patra A, Khasawneh H, et al. Radiomics-based Machine-learning Models Can Detect Pancreatic Cancer on Prediagnostic Computed Tomography Scans at a Substantial Lead Time Before Clinical Diagnosis. Gastroenterology 2022;163:1435-1446.e3. [Crossref] [PubMed]
  16. Yu Q, Ning Y, Wang A, et al. Deep learning-assisted diagnosis of benign and malignant parotid tumors based on contrast-enhanced CT: a multicenter study. Eur Radiol 2023;33:6054-65. [Crossref] [PubMed]
  17. Zhao L, Leng Y, Hu Y, et al. Understanding decision curve analysis in clinical prediction model research. Postgrad Med J 2024;100:512-5. [Crossref] [PubMed]
  18. Xing Z, Cai L, Wu Y, et al. Development and validation of a nomogram for predicting in-hospital mortality of patients with cervical spine fractures without spinal cord injury. Eur J Med Res 2024;29:80. [Crossref] [PubMed]
  19. Pyrros A, Chen A, Rodríguez-Fernández JM, et al. Deep Learning-Based Digitally Reconstructed Tomography of the Chest in the Evaluation of Solitary Pulmonary Nodules: A Feasibility Study. Acad Radiol 2023;30:739-48. [Crossref] [PubMed]
  20. Venkadesh KV, Aleef TA, Scholten ET, et al. Prior CT Improves Deep Learning for Malignancy Risk Estimation of Screening-detected Pulmonary Nodules. Radiology 2023;308:e223308. [Crossref] [PubMed]
  21. Venkadesh KV, Setio AAA, Schreuder A, et al. Deep Learning for Malignancy Risk Estimation of Pulmonary Nodules Detected at Low-Dose Screening CT. Radiology 2021;300:438-47. [Crossref] [PubMed]
  22. Iman M, Arabnia HR, Rasheed K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023;11:40.
  23. Zhu J, Zou L, Xie X, et al. 2.5D deep learning based on multi-parameter MRI to differentiate primary lung cancer pathological subtypes in patients with brain metastases. Eur J Radiol 2024;180:111712. [Crossref] [PubMed]
  24. Hu X, Gong J, Zhou W, et al. Computer-aided diagnosis of ground glass pulmonary nodule by fusing deep learning and radiomics features. Phys Med Biol 2021;66:065015. [Crossref] [PubMed]
  25. Zhang YB, Chen ZQ, Bu Y, et al. Construction of a 2.5D Deep Learning Model for Predicting Early Postoperative Recurrence of Hepatocellular Carcinoma Using Multi-View and Multi-Phase CT Images. J Hepatocell Carcinoma 2024;11:2223-39. [Crossref] [PubMed]
  26. Zhong Y, She Y, Deng J, et al. Deep Learning for Prediction of N2 Metastasis and Survival for Clinical Stage I Non-Small Cell Lung Cancer. Radiology 2022;302:200-11. [Crossref] [PubMed]
  27. Pan Z, Hu G, Zhu Z, et al. Predicting Invasiveness of Lung Adenocarcinoma at Chest CT with Deep Learning Ternary Classification Models. Radiology 2024;311:e232057. [Crossref] [PubMed]
  28. ŞAhiN E. Arslan NN, Özdemir D. Unlocking the black box: an in-depth review on interpretability, explainability, and reliability in deep learning. Neural Comput Applic 2024;37:859-965.
  29. Felder RM. Coming to Terms with the Black Box Problem: How to Justify AI Systems in Health Care. Hastings Cent Rep 2021;51:38-45. [Crossref] [PubMed]
  30. Hsu WH, Ko AT, Weng CS, et al. Explainable machine learning model for predicting skeletal muscle loss during surgery and adjuvant chemotherapy in ovarian cancer. J Cachexia Sarcopenia Muscle 2023;14:2044-53. [Crossref] [PubMed]
Cite this article as: Wang K, Huang X, Huang X, Mu X, Liu L, Jin G. Fusion of 2.5D deep transfer learning and radiomics for predicting benign and malignant Lung Imaging Reporting and Data System (Lung-RADS) 3 and 4A nodules. J Thorac Dis 2025;17(10):8360-8373. doi: 10.21037/jtd-2025-584

Download Citation