Performance evaluation of deep learning-based osteoporosis diagnostic models with conventional chest X-ray in a clinical cohort

Bona Koo; Yelin Roh; Gyubin Shin; Yujin Yang; Rena Lee; Sungho Cho; So Hyun Ahn; Kwan-Chang Kim

doi:10.21037/jtd-2025-1077

Original Article

Performance evaluation of deep learning-based osteoporosis diagnostic models with conventional chest X-ray in a clinical cohort

Bona Koo¹ , Yelin Roh¹ , Gyubin Shin¹, Yujin Yang¹, Rena Lee^2,3, Sungho Cho³, So Hyun Ahn^4,5, Kwan-Chang Kim⁶

¹School of Medicine, Ewha Womans University, Seoul, South Korea; ²Department of Biomedical Engineering, College of Medicine, Ewha Womans University, Seoul, South Korea; ³REMEDI Research and Development Center, Seoul, South Korea; ⁴Ewha Medical Research Institute, School of Medicine, Ewha Womans University, Seoul, South Korea; ⁵Ewha Medical Artificial Intelligence Research Institute, Ewha Womans University College of Medicine, Seoul, South Korea; ⁶Department of Thoracic and Cardiovascular Surgery, School of Medicine, Ewha Womans University, Seoul, South Korea

Contributions: (I) Conception and design: B Koo, Y Roh; (II) Administrative support: SH Ahn, KC Kim; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: G Shin, Y Yang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: So Hyun Ahn, PhD. Ewha Medical Research Institute, School of Medicine, Ewha Womans University, 25 Magokdong-ro 2-gil, Gangseo-gu, Seoul 07804, South Korea; Ewha Medical Artificial Intelligence Research Institute, Ewha Womans University College of Medicine, Seoul, South Korea. Email: mpsohyun@ewha.ac.kr; Kwan-Chang Kim, MD, PhD. Department of Thoracic and Cardiovascular Surgery, School of Medicine, Ewha Womans University, 25 Magokdong-ro 2-gil, Gangseo-gu, Seoul 07804, South Korea. Email: mdkkchang@ewha.ac.kr.

Background: Dual-energy X-ray absorptiometry (DXA) is the gold standard for diagnosing osteoporosis; however, its limited accessibility often hinders routine screening in primary care settings. To address this gap, we developed and evaluated a deep learning-based model, PROS^® CXR: OSTEO (Promedius, Inc., Seoul, South Korea), which predicts osteoporosis from conventional chest radiographs.

Methods: This retrospective study included 80 adult patients who underwent both DXA and chest radiography within a 3-month interval. The deep learning model, based on convolutional neural networks and trained via transfer learning, generated osteoporosis predictions from chest X-rays. Model performance was assessed against DXA-derived T-scores of the femur and lumbar spine, using either the minimum or average T-score per site as the reference standard.

Results: The proposed model achieved an area under the curve (AUC) of 0.94 for femur and 0.93 for lumbar spine predictions. For osteoporosis screening, the sensitivity and specificity were 90% and 81%, respectively. Subgroup analysis demonstrated higher predictive performance in female patients, whereas false-positives (FPs) occurred more frequently in males.

Conclusions: The PROS^® CXR: OSTEO model enables opportunistic and low-cost osteoporosis screening using routine chest radiographs. This approach holds promise for early detection in aging populations and resource-limited settings. Further optimization is required to improve specificity and minimize FPs before clinical implementation.

Keywords: Chest X-ray; osteoporosis; early diagnosis; deep learning; artificial intelligence (AI)

Submitted Jun 14, 2025. Accepted for publication Sep 03, 2025. Published online Nov 26, 2025.

doi: 10.21037/jtd-2025-1077

Highlight box

Key findings

• PROS^® CXR: OSTEO program, a convolutional neural network (CNN)-based model trained on chest X-rays, can predict osteoporosis with high sensitivity, offering an accessible alternative to dual-energy X-ray absorptiometry (DXA).

What is known and what is new?

• While DXA is the gold standard for osteoporosis diagnosis, its accessibility is limited.

• This study presents a deep learning model that can screen for osteoporosis from routine chest radiographs.

What is the implication, and what should change now?

• Opportunistic osteoporosis screening via chest X-rays could be implemented in routine care, especially in aging populations.

• A deep learning-based osteoporosis screening tool helps identify at-risk patients early, although the specificity in male patients needs improvement.

Introduction

Globally, approximately one-third of women over the age of 50 years and one-fifth of men in the same age group suffer from osteoporosis (1-4). With the increasing elderly population and rising life expectancy, the importance of early diagnosis is becoming more emphasized (5). The standard method for diagnosing osteoporosis, dual-energy X-ray absorptiometry (DXA), measures bone mineral density (BMD) at the spine or femur. Measured BMD are compared to the average BMD of a healthy 30-year-old adult at the same site, providing a T-score. Based on the T-score, bone health is classified as normal (T-score ≥−1.0), osteopenia (−1.0 to −2.5), or osteoporosis (T-score ≤−2.5). Interpretation of the T-score requires caution in very young or elderly individuals, as it is based on the average BMD of individuals in their thirties (6); in such cases, the Z-score which reflects age-, gender-, and body size-matched comparisons may be more appropriate, particularly for children, young adults, or those with suspected secondary osteoporosis.

Despite its diagnostic accuracy, the widespread use of DXA remains limited due to the high cost of equipment, restricted accessibility, and geographical disparities in healthcare infrastructure (7). Although quantitative computed tomography (QCT) and ultrasound bone densitometry are available as alternatives, they are constrained by high costs and reduced accuracy, respectively.

Rural and underserved populations often lack access to DXA scanners, resulting in delayed osteoporosis diagnosis and increased fracture risks. Additionally, DXA interpretation requires trained specialists, which can further limit its widespread adoption in primary care settings. Given these limitations, there is a growing demand for alternative screening methods that are cost-effective, widely available, and do not require extensive medical expertise for interpretation (8).

To address the limitations of existing diagnostic methods and enhance the early detection rates in asymptomatic patients, chest radiograph-based osteoporosis screening using deep learning technology has recently gained attention. Chest radiographs offer the potential to identify high-risk individuals for osteoporosis without additional radiation exposure. This diagnostic method shows significant potential and could be widely applied in various healthcare settings if clinically reliable interpretations can be secured, thereby serving as a valuable tool for early diagnosis (9,10).

The adoption of artificial intelligence (AI) in medical imaging has revolutionized disease detection across various domains, including oncology, cardiology, and endocrinology (11). AI-driven diagnostic models have demonstrated remarkable accuracy in identifying subtle radiological patterns that may not be immediately apparent to human observers (12). Deep learning models, particularly convolutional neural networks (CNNs), have been extensively studied for their ability to analyze medical images, outperforming traditional statistical models in many cases. Integrating AI into osteoporosis screening could significantly enhance diagnostic efficiency and facilitate large-scale population screening (11).

Several previous studies have demonstrated the excellent performance of deep learning models. particularly CNNs, through transfer learning and fine-tuning even with limited datasets. For example, Yamamoto et al. achieved an area under the curve (AUC) of 0.91 with a model combining hip radiographs and clinical covariates, while Tsai et al. recorded an area under the receiver operating characteristic curve (AU-ROC) of 0.93 in evaluating osteoporosis and mortality risk (12,13). However, He et al. found no significant performance difference between image-only models and those incorporating gender and age with VGG-16 and ResNet-34-based models (9). These findings highlight both the potential and limitations of deep learning-based osteoporosis screening models, emphasizing the need for further research to optimize these models.

Despite these advancements, challenges remain in translating AI-based osteoporosis screening into clinical practice. Model generalizability, interpretability, and integration into existing healthcare workflows require further investigation. The reliability of deep learning predictions is also influenced by dataset heterogeneity, imaging protocol variations, and the presence of comorbidities that may affect bone density estimation. Addressing these challenges is crucial to ensuring that AI-based screening models are both clinically reliable and ethically sound (8).

We aim to externally validate the diagnostic performance of the PROS^® CXR: OSTEO program (Promedius, Inc., Seoul, South Korea) using retrospective data, in comparison with DXA-based T-scores. Unlike previous research, PROS^® CXR: OSTEO program was trained with consideration of gender and age differences in osteoporosis, which sets it apart from studies that used a single dataset to train a unified model. This approach is expected to offer more precise diagnostic performance. Additionally, PROS^® CXR: OSTEO program was optimized by applying transfer learning based on Inception-v3, pretrained on ImageNet, enabling it to maintain high diagnostic accuracy even with relatively small datasets. Through this comparative analysis, we aim to identify the most predictive variables among DXA measurement sites (L1–L4) and explore key considerations in applying deep learning to osteoporosis diagnosis. However, as this study is based on a retrospective design with a relatively limited dataset size, careful interpretation of the results is warranted, and further prospective validation is encouraged.

Furthermore, this study will explore the potential of integrating AI-driven osteoporosis screening with electronic health record (EHR) systems to facilitate automated risk stratification. The seamless incorporation of AI predictions into clinical decision-making could enhance early intervention strategies and reduce the burden of osteoporotic fractures. As AI-based screening tools continue to evolve, their potential for widespread implementation in routine medical practice warrants comprehensive evaluation (14).

Ultimately, this research aspires to develop a cost-effective and efficient screening method that enhances accessibility to early osteoporosis diagnosis while assessing its practical applicability within clinical settings. We present this article in accordance with the TRIPOD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1077/rc).

Methods

Study design and participants

This study was designed as a retrospective observational study to evaluate the diagnostic performance of a deep learning-based osteoporosis prediction model using chest X-ray images. The study was conducted at a single institution and included a total of 80 participants (25 males and 55 females) who underwent DXA-based BMD measurements. All participants were recruited from the health screening center at Ewha Womans University Medical Center between January 2021 and December 2024. To minimize bias, chest X-ray images and DXA scans were acquired independently, and the PROS^® CXR: OSTEO program generated predictions retrospectively without human intervention. Analysts evaluating outcomes were blinded to model predictions.

Clinical and imaging data were retrospectively collected from electronic medical records (EMRs) between November 11 and December 7, 2024. The validation dataset differed from the original development data in institutional setting, patient demographics, and time of data acquisition. The current study included all chest radiographs regardless of pulmonary lesions and did not exclude patients with chronic respiratory or metabolic comorbidities, thereby enhancing the generalizability assessment. Exclusion criteria were applied to ensure data quality and consistency between imaging and BMD measurements. Patients were excluded if they met any of the following conditions:

Chest X-ray and DXA measurements not performed within a 3-month interval;
Poor-quality chest X-ray images, including motion artefacts or inadequate exposure;
Age below 40 or above 80 years at the time of examination.

No additional exclusion criteria were applied in order to evaluate the generalizability of the model in real-world clinical settings. Accordingly, participants with pulmonary lesions (e.g., nodules, tuberculosis, or other small abnormalities) were included. There were no missing data for predictor or outcome variables.

Participants were categorized into three groups based on their T-scores obtained from DXA measurements:

Normal: T-score ≥−1.0;
Osteopenia: −1.0> T-score ≥−2.5;
Osteoporosis: T-score ≤−2.5.

For premenopausal women, osteoporosis was assessed based on Z-scores. Baseline demographic information, including age, sex, body mass index (BMI), and BMD values, was collected and analyzed.

Imaging and model analysis

DXA measurement

BMD values were measured using a DXA scanner (Model: BHR-3-76/Prodigy Advance, GE Healthcare, Chicago, IL, USA; tube voltage 76kVp, tube current 3 mA; manufactured in November 2018) at two primary anatomical sites:

Lumbar spine (L1–4): individual BMD values for each vertebra and the average and standard deviation BMD value;
Femur: average and standard deviation BMD of value.

All DXA measurements adhered to the guidelines of the Korean Society for Bone and Mineral Research (KSBMR) and were conducted by experienced radiologic technologists.

Chest X-ray image acquisition

All chest X-ray images were acquired using the DigitalDiagnost system (Philips Healthcare, Amsterdam, The Netherlands). The images were obtained in the posteroanterior (PA) view and stored in digital imaging and communications in medicine (DICOM) format in November 2024. All PA chest X-rays were taken within three months of the patients’ BMD measurements.

The images were acquired by board-certified radiologists with the patients in the standing position during full inspiration. The technical parameters were standardized across all patients, with a tube voltage of 117.00 kV and a tube current ranging from 820 to 930 mA. No post-processing, such as contrast enhancement or resizing, was applied prior to analysis.

Evaluation of the deep learning model (PROS^® CXR: OSTEO program)

The deep learning-based algorithm, PROS^® CXR: OSTEO program, was applied to chest X-ray images to predict osteoporosis risk. The model was developed using a CNN architecture based on Inception-v3. Transfer learning was implemented using a pre-trained model from the ImageNet dataset (14).

Inception-v3 consists of multiple “Inception modules” that enable the model to extract multi-scale features through parallel convolutional layers with different kernel sizes (e.g., 1×1, 3×3, 5×5), followed by dimensionality reduction using 1×1 convolutions. The model also incorporates auxiliary classifiers to prevent vanishing gradients and improve training stability.

Transfer learning was implemented by initializing the model with weights pre-trained on the ImageNet dataset, a large-scale benchmark dataset containing over 1 million labelled images from 1,000 classes. The pre-trained network was then fine-tuned using medical imaging data curated by Promedius. Specifically, only the final classification layers were retrained on chest X-ray images labelled with DXA-based T-scores, while earlier layers were kept frozen to retain general visual features.

The model was developed using a dataset of 9,825 training, 1,212 validation, and 1,989 internal test cases (mean age 57.04 years; age range 40–90 years; predominantly female). The proportions of diagnostic categories (osteoporosis, osteopenia, and normal) were balanced across all sets. X-rays with medical devices were excluded.

Two diagnostic thresholds were applied for classification:

Diagnosis 1—a PROS^® CXR: OSTEO program score ≥0.2 was classified as osteopenia (T-score <−1.0);
Diagnosis 2—a PROS^® CXR: OSTEO program score ≥0.2 was classified as osteoporosis (T-score <−2.5).

The PROS^® CXR: OSTEO program system employs a predefined threshold score of 0.2 as a categorical decision boundary, with values exceeding this threshold indicating a high risk of osteoporosis. In this study, we evaluated whether this threshold effectively distinguishes only patients with osteoporosis or more broadly identifies individuals at risk, including those with osteopenia. To this end, we applied two diagnostic criteria and assessed the model’s performance by comparing its predictions with DXA-based T-score classifications. A confusion matrix was generated to quantitatively evaluate diagnostic accuracy.

It should be noted that this study was conducted as an external validation of the Promedius algorithm without any retraining or structural modifications. The primary objective was to assess whether the commercially developed PROS^® CXR: OSTEO program model could maintain its diagnostic accuracy when applied to chest X-ray images collected at our institution. Accordingly, all deep learning related parameters described in this manuscript such as model architecture, transfer learning strategy, and training procedures, reflect those originally implemented by the developers during the model’s initial development. No alterations to the algorithm or additional training were performed as part of this study.

As the prediction model is based on a proprietary deep learning architecture (Inception-v3), specific model coefficients and intercepts are not applicable. The model functions as a binary classifier using a threshold-based risk score output.

Performance evaluation metrics

The predictive performance of the model was assessed using the following metrics:

Accuracy: [true positive (TP) + true negative (TN)]/[TP + TN + false-positive (FP) + false-negative (FN)];
Sensitivity (recall): TP/(TP + FN);
Specificity: TN/(TN + FP);
Precision: TP/(TP + FP);
F1-score: 2× (precision × recall)/(precision + recall);
AUC: evaluation of the model’s discriminative ability.

All statistical analyses were conducted with a 95% confidence interval (CI). The confusion matrices were generated, and key performance metrics, including sensitivity, specificity, precision, recall, F1 score, and AUC, were compared (15).

Statistical analysis

Statistical analyses were performed using SPSS software (v.30) in December 2024. Normality of the data was assessed using the Shapiro-Wilk test. Differences between groups were analyzed using the following methods:

Independent t-test: used for continuous variables such as age, BMI, and BMD values;
Chi-squared test (or Fisher’s exact test): used for categorical variables such as sex;
Mann-Whitney U test: applied when variables did not follow a normal distribution;
Logistic regression analysis: used to evaluate risk factors associated with osteoporosis.

A significance level of P<0.05 was considered statistically significant.

To assess the agreement between AI-based osteoporosis diagnosis and actual DXA-based diagnosis, a Chi-squared (χ²) test was conducted.

Additionally, statistical power analysis was performed using G*Power 3.1, based on a medium effect size (Cohen’s w =0.3), significance level (α =0.05), and total sample size (n=80). The resulting statistical power was calculated to be 76.5%, which is close to the recommended threshold (80%), ensuring sufficient sample reliability.

Misclassification analysis

To investigate potential causes of misclassification, participants were divided into correctly classified and misclassified groups. The relationships between sex, age, BMI, and BMD values were analyzed. Additionally, predictive errors were examined to identify factors contributing to incorrect classification by the PROS^® CXR: OSTEO program model.

Further subgroup analyses were conducted to compare the clinical characteristics of FP and FN cases. The findings from these analyses provide insights into potential model improvements.

Ethical considerations

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Ewha Womans University Medical Center (No. SEUMC IRB 2025-01-011), and individual consent for this retrospective analysis was waived.

Results

Demographic and clinical characteristics

This study included a total of 80 participants, consisting of 25 males and 55 females. The average age was 62.16±7.11 years for males and 61.66±8.68 years for females, with no significant difference between the two groups (P=0.80) (Table 1). The average BMI was 23.87±3.54 kg/m² for males and 22.84±3.50 kg/m² for females, which also did not show a statistically significant difference (P=0.23).

Table 1

Baseline characteristics of the study population (comparison of demographic and clinical characteristics between male and female participants)

Characteristics	Male (n=25)	Female (n=55)	P
Age, years	62.16±7.11	61.66±8.68	0.80
BMI, kg/m²	23.87±3.54	22.84±3.50	0.23
Femur average BMD	0.64±1.03	−0.97±1.14	<0.001
L-spine average BMD	0.39±1.90	−0.94±1.46	0.001
Bone health
Normal	22 (88.0)	22 (40.0)	–
Osteopenia	2 (8.0)	21 (38.2)	–
Osteoporosis	1 (4.0)	12 (21.8)	–

Values are presented as mean ± standard deviation for continuous variables and as counts (percentages) for categorical variables. P values were calculated using the independent t-test for continuous variables and the Chi-squared test for categorical variables. Normal: T-score ≥−1.0. Osteopenia: −1.0> T-score ≥−2.5. Osteoporosis: T-score ≤−2.5. BMD, bone mineral density; BMI, body mass index.

BMD measurements revealed significant differences between males and females. The femur BMD was 0.64±1.03 in males and −0.97±1.14 in females (P<0.001), while the lumbar spine BMD was 0.39±1.90 in males and −0.94±1.46 in females (P=0.001). These results indicate that females generally had lower BMD compared to males.

Based on DXA results, the prevalence of osteoporosis differed significantly by gender. Among males, 88.0% were classified as normal, 8.0% had osteopenia, and 4.0% were diagnosed with osteoporosis. In contrast, among females, only 40.0% were classified as normal, while 38.2% had osteopenia, and 21.8% were diagnosed with osteoporosis. These findings suggest that osteoporosis and osteopenia are significantly more prevalent among females than males.

When stratifying subjects into two groups based on PROS^® CXR: OSTEO program’s classification—a normal group and an at-risk group (osteopenia + osteoporosis)—distinct differences were observed in age, BMI, and sex distribution.

The average age in the normal group was 59.0 years, compared to 65.2 years in the at-risk group (P<0.001). Mean BMI was 23.6 kg/m² in the normal group and 22.6 kg/m² in the at-risk group (P=0.24). Participant age ranged from 42 to 79 years, and BMI ranged from 15.045 to 32.908 kg/m².

Regarding sex distribution, the normal group included 22 males and 22 females, while the at-risk group comprised 3 males and 33 females.

Comparison of PROS^® CXR: OSTEO program’s predictions and T-score-based diagnostic criteria (Diagnosis 1 and 2)

The accuracy, sensitivity, specificity, F1 score, and AUC for L-spine and femur BMD predictions under both criteria are summarized in Table 2. Classification outcomes (TP, TN, FP, FN) of PROS^® CXR: OSTEO program for osteoporosis diagnosis according to five different reference standards are summarized in Table 3. Figure 1 shows the ROC curves for osteoporosis diagnosis based on different reference standards.

Table 2

Diagnostic performance of the PROS^® CXR: OSTEO program for osteoporosis classification under two diagnostic thresholds

Items	L-spine average		L-spine lowest		Femur average		Femur lowest		Real diagnosis
Items	Diagnosis 1	Diagnosis 2	Diagnosis 1	Diagnosis 2	Diagnosis 1	Diagnosis 2	Diagnosis 1	Diagnosis 2	Diagnosis 1	Diagnosis 2
Accuracy	0.85	0.64	0.78	0.71	0.79	0.59	0.68	0.65	0.83	0.69
Sensitivity	0.91	1.00	0.71	0.86	0.86	1.00	0.60	0.92	0.83	1.00
Specificity	0.81	0.59	0.88	0.66	0.75	0.56	0.90	0.60	0.82	0.63
Precision	0.76	0.24	0.90	0.47	0.66	0.13	0.95	0.29	0.79	0.34
Recall	0.91	1.00	0.71	0.86	0.86	1.00	0.60	0.92	0.83	1.00
F1 score	0.83	0.38	0.79	0.61	0.75	0.23	0.74	0.44	0.81	0.51
AUC	0.93	0.93	0.90	0.95	0.94	0.90	0.83	0.88	0.91	0.93

Accuracy, sensitivity, specificity, precision, recall, F1 score, and AUC values are reported for each classification method. Diagnosis 1—PROS^® CXR: OSTEO program score ≥0.2 classified as osteopenia (T-score <−1.0). Diagnosis 2—PROS^® CXR: OSTEO program score ≥0.2 classified as osteoporosis (T-score <−2.5). AUC, area under curve; F1 score, harmonic mean of precision and recall.

Table 3

Classification outcomes (TP, TN, FP, FN) of PROS^® CXR: OSTEO program for osteoporosis diagnosis according to five different reference standards

Items	L-spine		Femur		Real diagnosis
Items	Average	Minimum	Average	Minimum	Real diagnosis
TP	29	34	25	36	30
FP	9	4	13	2	8
TN	39	28	38	24	36
FN	3	14	4	18	6

FN, false-negative; FP, false-positive; TN, true negative; TP, true positive.

Figure 1 ROC curves demonstrating the diagnostic performance of PROS^® CXR: OSTEO program in identifying osteoporosis based on different reference standards. The ROC curve in the figure was generated using Python with assistance from ChatGPT (OpenAI) for scripting and visualization. AI, artificial intelligence; AUC, area under the curve; ROC, receiver operating characteristic.

In Diagnosis 1, the L-spine average BMD exhibited high performance, achieving a sensitivity of 0.91, specificity of 0.81, precision of 0.76, recall of 0.91, an F1 score of 0.83, and an AUC of 0.93. In contrast, under Diagnosis 2, the sensitivity reached 1.0; however, the specificity (0.59), precision (0.24), and F1 score (0.38) were significantly lower, indicating a higher FP rate. Based on these findings, Diagnosis 1 was determined to be the more balanced and reliable threshold. Therefore, further patient analysis was conducted using Diagnosis 1.

Analysis of features in correctly classified vs. misclassified groups

The demographic and clinical characteristics of correctly classified and misclassified cases under Diagnosis 1 are presented in Table 4.

Table 4

Baseline characteristics of correctly and misclassified cases by the PROS^®CXR: OSTEO program

Characteristics	Correctly classified group (n=66)	Misclassified group (n=14)	P
Age, years	61.82±8.34	61.79±7.68	0.99
Sex (male)	20 (30.3)	5 (35.7)	0.76
BMI, kg/m²	23.15±3.31	23.21±4.56	0.96
Femur average	−0.45±1.43	−0.55±0.77	0.70
L-spine average	−0.56±1.84	−0.39±0.97	0.63

Categorical data are presented as numbers of participants with percentages in parentheses, while continuous data are expressed as means ± standard deviations. The Student’s t-test was used to compare continuous variables, and the Chi-squared (χ²) test was used to compare categorical variables. The P value is two-sided, and no adjustments were made for multiple comparisons. The values for femur average and L-spine average in the table represent the mean T-scores derived from DXA measurements. BMI, body mass index; DXA, dual-energy X-ray absorptiometry.

Analysis of FP and FN cases

Next, we analyzed the gender, age, and BMI of patients misclassified by the Promedius AI model. Among the 25 male participants, 5 were classified as FPs. In contrast, all 6 FN cases occurred in postmenopausal female participants. When analyzing the correlation between gender and misclassification trends, the P value for gender and FP was 0.10, while the P value for gender and FN was 0.17. The average age of the FP group was 64.25 years, whereas the average age of the FN group was 59.95 years. When comparing age-based classification, the P values between the correctly classified group (TN + TP) and the FN and FP groups were 0.495 and 0.55, respectively, while the P value between the FP and FN groups was 0.23. Additionally, the mean BMI for the FP group was 25.09 kg/m², compared to 23.25 kg/m² for the TN + TP group and 21.8 kg/m² for the FN group. The P values for BMI comparisons between the TN + TP group and the FN and FP groups were 0.42 and 0.28, respectively, while the P value between the FP and FN groups was 0.41.

Analysis of the clinical characteristics (such as prior surgeries or medical history) of the patients did not reveal any significant distinguishing features between the FP and FN groups. Notably, one male patient in the FP group had all T-scores below −1 and an L4 spine T-score below −2.5, yet was clinically diagnosed as normal.

Discussion

This study evaluated the performance of the deep learning-based osteoporosis screening model, PROS^® CXR: OSTEO program, in early osteoporosis diagnosis using chest X-ray images. The findings indicate that PROS^® CXR: OSTEO program demonstrates high diagnostic accuracy, particularly when applying a classification threshold of 0.2 for osteopenia (Diagnosis 1).

Performance evaluation of PROS^® CXR: OSTEO program

The model’s diagnostic performance was assessed based on multiple criteria, including the lowest and average L-spine and femur T-scores, along with actual DXA-based diagnoses. Under Diagnosis 1, the model achieved a sensitivity of 0.91, specificity of 0.81, precision of 0.76, recall of 0.91, F1 score of 0.83, and AUC of 0.93, particularly when considering the L-spine average T-score. In contrast, Diagnosis 2 exhibited an increased sensitivity of 1.0 but suffered from lower specificity (0.59), precision (0.24), and F1 score (0.38), leading to a higher FP rate.

Importantly, our findings are consistent with those reported in prior work, particularly Jang et al. [2021] (14), which developed the OsPor-screen model using deep learning on chest radiographs and achieved an AUC of 0.91 for internal validation and 0.88 for external data. Our model demonstrated comparable or slightly higher diagnostic accuracy across internal testing metrics. By explicitly benchmarking our results against this established reference, we provide contextual evidence that the PROS^® CXR: OSTEO program performs at a clinically meaningful level, aligned with state-of-the-art approaches in opportunistic osteoporosis screening using chest X-rays. This comparative validation highlights the robustness and real-world applicability of our model in routine clinical settings.

Given these findings, Diagnosis 1 is considered the more balanced and reliable threshold. PROS^® CXR: OSTEO program effectively identifies individuals at risk of osteoporosis rather than diagnosing the disease itself, making it a valuable pre-screening tool for early intervention. The model’s ability to screen high-risk patients before DXA confirmation enhances its clinical utility by improving accessibility to osteoporosis detection and optimizing resource allocation.

Additionally, PROS^® CXR: OSTEO program demonstrated a high AUC (>0.9) across all metrics except for femur’s lowest scores, indicating strong diagnostic capability. The results suggest that deep learning-based screening could complement existing DXA-based diagnostics by providing an alternative, widely accessible pre-screening method (16). In this study, the evaluation of the model’s performance was not solely dependent on accuracy and AUC values. Instead, various metrics, including sensitivity, specificity, and F1-score, were analyzed to assess the diagnostic tendencies of the AI. The results demonstrated that both sensitivity and specificity were similarly high, indicating that PROS^® CXR: OSTEO program exhibited a balanced diagnostic performance without significant bias toward either FN or FPs.

Analysis of misclassification trends

Despite its high overall accuracy, the PROS^® CXR: OSTEO program exhibited some misclassifications. To identify the underlying factors, we analyzed the demographic and clinical characteristics associated with FP and FN cases. The analysis revealed several key trends. First, regarding gender-related patterns, all FN cases (n=6) were postmenopausal women, while five out of the eight FP cases were male. This suggests that differences in bone density loss patterns between genders may influence the model’s predictions. Second, in terms of age-related trends, the FN group had a higher average age of 64.25 years compared to 61.82 years in the correctly classified group and 59.95 years in the FP group. However, given the observed correlation between age and BMD decline (R =0.34), this factor requires careful interpretation. Lastly, BMI-related trends showed that FN cases had a higher mean BMI (25.09 kg/m²) compared to correctly classified cases (23.15 kg/m²) and FP cases (21.8 kg/m²). The presence of extreme BMI values in both FP and FN groups suggests that BMI may play a role in misclassification, highlighting the need for further investigation.

Among FP cases, one patient had all T-scores below −1 and an L4 T-score below −2.5, but was clinically diagnosed as normal. This individual was a male under 50 years old with a BMI of 15 kg/m², indicating that clinicians likely considered additional factors such as fracture risk before making a final diagnosis. These findings highlight potential limitations of using only T-scores for osteoporosis diagnosis and suggest that integrating additional clinical features (e.g., Z-scores, fracture history) into PROS^® CXR: OSTEO program could enhance its predictive accuracy.

Implications for clinical application

The findings of this study highlight the potential of deep learning-based screening models in detecting osteoporosis. PROS^® CXR: OSTEO program provides a rapid, accessible, and non-invasive method for identifying individuals at risk, which may be especially valuable in primary care settings and underserved regions where access to DXA is limited. However, this study was conducted as a preliminary validation using a relatively small dataset from a single center, which limits the generalizability of the findings.

To fully integrate the model into clinical practice, further validation using large-scale, multi-center datasets is essential. Such studies would help to confirm the reproducibility and robustness of the model across diverse populations, imaging settings, and healthcare environments. Until then, while the current results are promising, the clinical applicability of PROS^® CXR: OSTEO program should be interpreted with caution and seen as a foundation for future development rather than a definitive solution.

First, refinement of diagnostic criteria is necessary, as misclassification patterns indicate that combining femur and L-spine BMD measurements could improve accuracy, especially for postmenopausal women. Future research should focus on optimizing threshold adjustments to minimize FN cases. Second, enhanced model training may improve performance by incorporating a more diverse dataset that includes larger populations with varying demographics and clinical characteristics. Third, integration with clinical decision support systems could further enhance the model’s utility. Embedding Promedius into EHRs would enable automated risk stratification and support early intervention strategies. Finally, further validation studies are essential. Large-scale, multi-center studies are needed to assess the generalizability of PROS^® CXR: OSTEO program and confirm its effectiveness in clinical applications.

Limitation

This study is a pilot study conducted at a single institution, using a sample from Ewha Womans University Medical Center. A total of 80 participants were included, which is sufficient to validate the performance of PROS^® CXR: OSTEO program. However, the small number of FP (n=6) and FN (n=8) cases limit the ability to analyze common patterns among misclassified cases. To further evaluate the generalizability of PROS^® CXR: OSTEO program and conduct a more detailed analysis of misclassification causes, future studies should incorporate external datasets or larger sample sizes (17,18). Additionally, further validation is needed with a larger cohort that includes a wider range of BMI and age distributions.

Since this study was conducted at a single institution and used data exclusively from a Korean population, there are limitations in applying PROS^® CXR: OSTEO program to other demographic groups (19). In particular, additional research involving diverse ethnic and regional populations is necessary to assess the model’s applicability as an osteoporosis screening tool in regions with limited access to medical resources.

Furthermore, this study utilized X-ray images obtained under specific imaging settings and from particular radiographic equipment, which may affect the model’s generalizability (20). Future studies should evaluate whether PROS^® CXR: OSTEO program maintains consistent performance under different imaging conditions, including the use of portable X-ray systems (10).

Conclusions

In summary, PROS^® CXR: OSTEO program demonstrates strong potential as a deep learning-based osteoporosis screening tool. While it exhibits high diagnostic performance, particularly under Diagnosis 1, improvements in misclassification reduction and dataset diversity are needed to enhance its reliability. By refining AI-driven screening models, we can advance early osteoporosis detection and improve accessibility to preventive care.

Acknowledgments

AI tools were used during the preparation of this manuscript. Specifically, ChatGPT (OpenAI) was employed to assist in generating Python code for ROC curve visualization and in improving the clarity and formatting of the text according to the Journal of Thoracic Disease submission guidelines. Additionally, Perplexity AI was used to search for and access relevant literature referenced in this study. All critical decisions regarding data interpretation and manuscript content were made by the authors.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1077/rc

Data Sharing Statement: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1077/dss

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1077/prf

Funding: This research was supported by the Ewha Womans University Research Grant of 2024, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. RS-2023-00240003), and by the Korean government (Ministry of Science and ICT) (No. RS-2025-25394457) and by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea government (Ministry of Trade, Industry and Energy) (P0030108, Industrial Technology Innovation Project - International Joint Technology Development Project - Strategic Technology Type - Global Demand-Linked Type).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1077/coif). All authors report the funding from the Ewha Womans University Research Grant of 2024, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. RS-2023-00240003). The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Ewha Womans University Medical Center (No. SEUMC IRB 2025-01-011), and individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Curtis EM, van der Velde R, Moon RJ, et al. Epidemiology of fractures in the United Kingdom 1988-2012: Variation with age, sex, geography, ethnicity and socioeconomic status. Bone 2016;87:19-26. [Crossref] [PubMed]
Kanis JA, Johnell O, Oden A, et al. Long-term risk of osteoporotic fracture in Malmö. Osteoporos Int 2000;11:669-74. [Crossref] [PubMed]
Cummings SR, Cauley JA, Palermo L, et al. BMD and risk of hip and nonvertebral fractures in older men: a prospective study and comparison with older women. J Bone Miner Res 2006;21:1550-60. [Crossref] [PubMed]
Gourlay ML, Fine JP, Preisser JS, et al. Bone-density testing interval and transition to osteoporosis in older women. N Engl J Med 2012;366:225-33. [Crossref] [PubMed]
Banh K. Essentials of Osteoporosis: Early Prevention, Screening, and Management of this Silent Disease. 2022. Available online: https://scholarworks.arcadia.edu/showcase/2022/pa/64/
Prevention and management of osteoporosis. World Health Organ Tech Rep Ser 2003;921:1-164. back cover.
Mithal A, Bansal B, Kyer CS, et al. The Asia-Pacific Regional Audit-Epidemiology, Costs, and Burden of Osteoporosis in India 2013: A report of International Osteoporosis Foundation. Indian J Endocrinol Metab 2014;18:449-54. [Crossref] [PubMed]
El Maghraoui A, Roux C. DXA scanning in clinical practice. QJM 2008;101:605-17. [Crossref] [PubMed]
He Y, Lin J, Zhu S, et al. Deep learning in the radiologic diagnosis of osteoporosis: a literature review. J Int Med Res 2024;52:3000605241244754. [Crossref] [PubMed]
Park H, Kang WY, Woo OH, et al. Automated deep learning-based bone mineral density assessment for opportunistic osteoporosis screening using various CT protocols with multi-vendor scanners. Sci Rep 2024;14:25014. [Crossref] [PubMed]
Shaik A, Larsen K, Lane NE, et al. A staged approach using machine learning and uncertainty quantification to predict the risk of hip fracture. Bone Rep 2024;22:101805. [Crossref] [PubMed]
Tsai DJ, Lin C, Lin CS, et al. Artificial Intelligence-enabled Chest X-ray Classifies Osteoporosis and Identifies Mortality Risk. J Med Syst 2024;48:12. [Crossref] [PubMed]
Yamamoto N, Sukegawa S, Kitamura A, et al. Deep Learning for Osteoporosis Classification Using Hip Radiographs and Patient Clinical Covariates. Biomolecules 2020;10:1534. [Crossref] [PubMed]
Jang M, Kim M, Bae SJ, et al. Opportunistic Osteoporosis Screening Using Chest Radiographs With Deep Learning: Development and External Validation With a Cohort Dataset. J Bone Miner Res 2022;37:369-77. [Crossref] [PubMed]
Reitsma JB, Glas AS, Rutjes AW, et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 2005;58:982-90. [Crossref] [PubMed]
Ou Yang WY, Lai CC, Tsou MT, et al. Development of Machine Learning Models for Prediction of Osteoporosis from Clinical Health Examination Data. Int J Environ Res Public Health 2021;18:7635. [Crossref] [PubMed]
Bleeker SE, Moll HA, Steyerberg EW, et al. External validation is necessary in prediction research: a clinical example. J Clin Epidemiol 2003;56:826-32. [Crossref] [PubMed]
Sato Y, Yamamoto N, Inagaki N, et al. Deep Learning for Bone Mineral Density and T-Score Prediction from Chest X-rays: A Multicenter Study. Biomedicines 2022;10:2323. [Crossref] [PubMed]
Suh B, Yu H, Kim H, et al. Interpretable Deep-Learning Approaches for Osteoporosis Risk Screening and Individualized Feature Analysis Using Large Population-Based Data: Model Development and Performance Evaluation. J Med Internet Res 2023;25:e40179. [Crossref] [PubMed]
Rhee DJ, Kim S, Jeong DH, et al. Effects of the difference in tube voltage of the CT scanner on dose calculation. Journal of the Korean Physical Society 2015;67:123-8.

Cite this article as: Koo B, Roh Y, Shin G, Yang Y, Lee R, Cho S, Ahn SH, Kim KC. Performance evaluation of deep learning-based osteoporosis diagnostic models with conventional chest X-ray in a clinical cohort. J Thorac Dis 2025;17(11):10127-10137. doi: 10.21037/jtd-2025-1077

Performance evaluation of deep learning-based osteoporosis diagnostic models with conventional chest X-ray in a clinical cohort

Highlight box

Introduction

Methods

Study design and participants

Imaging and model analysis

DXA measurement

Chest X-ray image acquisition

Evaluation of the deep learning model (PROS^® CXR: OSTEO program)

Performance evaluation metrics

Statistical analysis

Misclassification analysis

Ethical considerations

Results

Demographic and clinical characteristics

Table 1

Comparison of PROS^® CXR: OSTEO program’s predictions and T-score-based diagnostic criteria (Diagnosis 1 and 2)

Table 2

Table 3

Analysis of features in correctly classified vs. misclassified groups

Table 4

Analysis of FP and FN cases

Discussion

Performance evaluation of PROS^® CXR: OSTEO program

Analysis of misclassification trends

Implications for clinical application

Limitation

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share

Highlight box

Introduction

Methods

Study design and participants

Imaging and model analysis

DXA measurement

Chest X-ray image acquisition

Evaluation of the deep learning model (PROS® CXR: OSTEO program)

Performance evaluation metrics

Statistical analysis

Misclassification analysis

Ethical considerations

Results

Demographic and clinical characteristics

Table 1

Comparison of PROS® CXR: OSTEO program’s predictions and T-score-based diagnostic criteria (Diagnosis 1 and 2)

Table 2

Table 3

Analysis of features in correctly classified vs. misclassified groups

Table 4

Analysis of FP and FN cases

Discussion

Performance evaluation of PROS® CXR: OSTEO program

Analysis of misclassification trends

Implications for clinical application

Limitation

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share

Evaluation of the deep learning model (PROS^® CXR: OSTEO program)

Comparison of PROS^® CXR: OSTEO program’s predictions and T-score-based diagnostic criteria (Diagnosis 1 and 2)

Performance evaluation of PROS^® CXR: OSTEO program