A systematic review of artificial intelligence-based diagnosis models for idiopathic pulmonary fibrosis

Ruo-Nan Yan; Hai-Yang Hu; Zi-Ru Ma; Wen-Han Zhao; Li-Li Xu; Dan-Yang Zang; Shu-Guang Yang; Xue-Qing Yu

doi:10.21037/jtd-2026-1-0307

Review Article

A systematic review of artificial intelligence-based diagnosis models for idiopathic pulmonary fibrosis

Ruo-Nan Yan^1,2, Hai-Yang Hu^1,2, Zi-Ru Ma^1,2, Wen-Han Zhao^1,2, Li-Li Xu^1,2, Dan-Yang Zang¹, Shu-Guang Yang¹, Xue-Qing Yu¹

¹National Regional Traditional Chinese Medicine (Lung Disease) Diagnosis and Treatment Center, The First Affiliated Hospital of Henan University of CM, Zhengzhou, China; ²The First Clinical Medicine College, Henan University of Chinese Medicine, Zhengzhou, China

Contributions: (I) Conception and design: RN Yan, XQ Yu; (II) Administrative support: RN Yan, SG Yang; (III) Provision of study materials or patients: RN Yan, SG Yang; (IV) Collection and assembly of data: RN Yan, HY Hu, ZR Ma, WH Zhao; (V) Data analysis and interpretation: RN Yan, LL Xu, DY Zang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Prof. Xue-Qing Yu, MD. Chief Physician, National Regional Traditional Chinese Medicine (Lung Disease) Diagnosis and Treatment Center, The First Affiliated Hospital of Henan University of CM, No. 19 Renmin Road, Zhengzhou 450000, China. Email: yxqshi@163.com.

Background: Idiopathic pulmonary fibrosis (IPF) is a major and difficult disease with unknown etiology and continuous progression, for which early and accurate diagnosis is challenging. With the advancement of medical technology and science, artificial intelligence (AI) has achieved significant results in IPF diagnosis and prediction of patient prognosis. However, the diagnostic and predictive performance of these AI models still lacks comprehensive evidence. Therefore, this study systematically reviewed and critically appraised the diagnostic performance of the model in IPF patients, aiming to promote the development of future related research.

Methods: A computerized systematic search was conducted in the China National Knowledge Infrastructure (CNKI), Wanfang, China Science and Technology Journal Database (VIP), PubMed, The Cochrane Library, Web of Science and Embase for relevant literature on IPF diagnosis models, with the search period ranging from database inception to December 1, 2025. Three researchers independently screened the literature. Data were extracted according to the key assessment and data extraction Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS). The risk of bias and applicability of the models were assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST). The quality of model reporting was evaluated using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) checklist.

Results: A total of 11 studies were included, all of which reported on the development and validation of the model. The most common predictive factor included in the model is genes. In terms of bias risk, 3 studies were rated as high bias risk, with bias risk mainly coming from outcome reporting and analysis areas. In terms of applicability, 4 studies were rated as high-risk and 7 studies were rated as unclear, indicating low clinical applicability of the model.

Conclusions: Currently, the predictive model for IPF diagnosis is still in the exploratory stage, with good model discrimination but high overall risk of bias. In the future, research design should be optimized and the reporting process should be improved to ensure the development of clinically practical predictive models.

Keywords: Idiopathic pulmonary fibrosis (IPF); artificial intelligence (AI); diagnosis model; systematic review

Submitted Feb 01, 2026. Accepted for publication Apr 02, 2026. Published online Apr 23, 2026.

doi: 10.21037/jtd-2026-1-0307

Highlight box

Key findings

• This systematic review comprehensively summarizes current artificial intelligence (AI)-based diagnostic models for idiopathic pulmonary fibrosis (IPF) and demonstrates that, although many models report excellent discriminative performance, the overall risk of bias is high and clinical applicability remains limited. Gene-based models showed the highest reported area under the curve values, while models based on computed tomography or routine clinical variables were generally more accessible but methodologically heterogeneous. Most models relied on internal or database-based validation, with insufficient calibration assessment and limited real-world generalizability.

What is known and what is new?

• Previous studies have shown that AI can achieve promising diagnostic accuracy for IPF, but there is significant heterogeneity and the reasons have not been systematically analyzed.

• This study aimed to conduct a comprehensive methodological assessment of AI-based diagnosis models using the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies, International Prospective Register of Systematic Reviews, and Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis frameworks. We systematically evaluated risk of bias and applicability. This approach provides a complementary perspective, revealing important methodological and translational limitations.

What is the implication, and what should change now?

• The findings indicate that high reported diagnostic accuracy does not necessarily translate into clinical reliability. Future research should focus on developing clinically accessible, well-validated models that can meaningfully support multidisciplinary diagnosis of IPF in real-world.

Introduction

Idiopathic pulmonary fibrosis (IPF) is a chronic, progressive fibrosing interstitial lung disease (ILD) of unknown etiology, clinically characterized by persistent dry cough and progressively worsening dyspnea (1). Epidemiological studies have shown that (2) the incidence of IPF has been increasing annually, with a reported annual growth rate of up to 11%. The disease progresses rapidly, with a median survival of only 3–5 years (3), and carries an extremely poor prognosis, making it a major refractory disease that poses a serious threat to public health. Currently, therapeutic options for IPF have expanded, including antifibrotic agents such as pirfenidone and nintedanib, as well as supportive interventions such as oxygen therapy and pulmonary rehabilitation (4,5). These approaches can slow the decline in lung function and improve patients’ quality of life (6). However, their overall efficacy remains limited and they are unable to reverse disease progression. Compared with the limitations of existing treatments, the insidious onset and complex pathophysiological mechanisms of IPF, which result in substantial challenges in early identification and precise diagnosis, have become critical bottlenecks restricting optimal treatment timing and patient prognosis, and there is an urgent need to conduct relevant research.

In recent years, with the widespread application of artificial intelligence (AI) in healthcare, predictive models constructed using machine learning and related algorithms have enabled the integration of multidimensional data, including clinical indicators, imaging features, and biomarkers, thereby providing quantitative support for individualized assessment and facilitating early disease detection and prognostic evaluation (7). Although AI has demonstrated promising potential in the diagnosis and prognostic prediction of IPF, several limitations remain, including unclear model staging criteria and insufficient validation of stability and generalizability (8), which hinder its clinical translation. Therefore, this study aims to systematically review and critically appraise the existing evidence on AI-based diagnostic models for IPF, with a focus on methodological quality, reporting standards, and clinical applicability, while identifying key limitations affecting their real-world implementation. We present this article in accordance with the PRISMA reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2026-1-0307/rc).

Methods

Registration of the study

The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) (ID: CRD420261283078). Data extraction was to guide by the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) (9). The risk of bias and applicability of the included studies were assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST) (10-12). The completeness and quality of model reporting were evaluated in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) (13-15) checklist.

Search strategy

A comprehensive literature search was conducted in both Chinese and English databases, including China National Knowledge Infrastructure (CNKI), Wanfang Database, China Science and Technology Journal Database (VIP), PubMed, Web of Science, Embase, and The Cochrane Library, from database inception to December 1, 2025. A search strategy combining MeSH terms and free-text keywords was applied, including terms related to IPF (‘IPF’, ‘idiopathic pulmonary fibrosis’, ‘idiopathic pulmonary fibroses’) and AI (e.g., ‘artificial intelligence’, ‘deep learning’, ‘machine learning’, ‘neural networks’). No restrictions were applied regarding study type or country.

Inclusion and exclusion criteria

Inclusion criteria: (I) studies involving patients diagnosed with IPF; (II) the study focused on the construction or validation of models aimed at diagnosing, classifying or identifying IPF.

Exclusion criteria: (I) non-IPF related diagnostic models; (II) models developed using conventional statistical methods only, without AI techniques; (III) conference abstracts, animal studies, reviews, editorials, or studies with unavailable full texts; (IV) studies that only investigated risk factors or predictors without constructing a diagnostic model; (V) studies focusing solely on radiologic usual interstitial pneumonia (UIP) pattern classification.

Study selection

The study selection and data extraction processes were independently conducted by three researchers (H.Y.H, Z.R.M. and W.H.Z.) to ensure accuracy and consistency. All identified studies were imported into EndNote X9, and duplicate entries were meticulously removed. For the initial selection of appropriate studies, we reviewed the titles and abstracts. Subsequently, the full texts of the selected studies were thoroughly read. In the event of any disputes, a fourth researcher (R.N.Y.) was enlisted to assist with arbitration, ensuring the consistency and reliability of the selection and extraction processes.

Data extraction

Data extraction was performed according to the CHARMS checklist (9), and the extracted content includes: basic information of the study (such as first author, publication year, country), study design, research subjects, candidate variables, sample size, model building method, model performance (such as discriminability, calibration), validation method (internal or external validation), etc.

Risk of bias and applicability evaluation

Two reviewers (R.N.Y. and L.L.X.) independently assessed studies according to the PROBAST (10-12), with inconsistencies resolved by consensus or a third reviewer (D.Y.Z.).

Risk of bias comprises 20 signaling questions categorized into four domains: participants, predictors, outcome, and analysis. Each signaling question can be answered as ‘yes’, ‘probably yes’, ‘no’, ‘probably no’ or ‘unclear’. A domain was judged as having low risk of bias if all signaling questions within that domain were answered as ‘yes’ or ‘probably yes’. A domain was considered to have high risk of bias if any question was answered as ‘no’ or ‘probably no’. If any question was answered as ‘unclear’, the risk of bias for that domain was rated as unclear. Overall risk of bias was considered low if all domains were judged as low risk; high if at least one domain was judged as high risk; and unclear if one or more domains were judged as unclear while all remaining domains were at low risk.

Applicability was evaluated across three domains: participants, predictors, and outcomes, with each domain rated as ‘low’, ‘high’, or ‘unclear’. Overall applicability was considered high if all domains were rated as low concern; low if any domain was rated as high concern; and unclear if at least one domain was rated as unclear while the remaining domains were rated as low concern.

Report quality evaluation

The reporting quality of the included studies was independently assessed by two reviewers (R.N.Y. and D.Y.Z.) in accordance with the TRIPOD (13-15) checklist.

The TRIPOD checklist consists of six domains: title and abstract, introduction, methods, results, discussion, and other information. Which comprising a total of 22 signaling questions, and each signaling question was rated as ‘reported’, ‘partially reported’, or ‘not reported’. A domain was considered to have high reporting quality if all signaling questions within that domain were fully reported. A domain was rated as having moderate reporting quality if no signaling question was rated as ‘not reported’ but at least one question was rated as ‘partially reported’. A domain was considered to have low reporting quality if one or more signaling questions were rated as ‘not reported’. Overall reporting quality was judged as high if all domains were rated as high; moderate if no domain was rated as low and more than 50% of the domains were rated as moderate or high; and low if any single domain was rated as low.

Results

Study selection

A total of 1,147 records were identified through the initial literature search. After removal of 474 duplicate records, 673 records remained. Following screening of titles and abstracts, 599 records were excluded for not meeting the inclusion criteria. The full texts of the remaining articles were assessed for eligibility, and 63 articles were further excluded. Ultimately, 11 studies (16-26) were included in the final analysis. The literature screening process and results are presented in Figure 1.

Figure 1 PRISMA flowchart of literature search and selection. CNKI, China National Knowledge Infrastructure; VIP, China Science and Technology Journal Database.

Study characteristics

A total of 11 studies published between 2022 and 2025 were included. Among them, 7 studies were conducted in China (16-22), 2 in the United States (23,24), 1 in Japan (25), and 1 in the Netherlands (26). Among the included studies, 7 (16-18,20-22,26) focused exclusively on IPF, while the remaining studies compared IPF with non-IPF ILD or connective tissue disease-associated ILD (CTD-ILD). In terms of data sources, gene expression profiles were the most commonly used predictors (16-18,20-22), followed by lung computed tomography (CT) imaging (19,23,26), clinical and laboratory variables (24), and histopathological images (25). The AI methods for building models included random forest (RF), least absolute shrinkage and selection operator (LASSO), support vector machines (SVM), convolutional neural networks (CNNs), etc. A quantitative meta-analysis was not performed due to substantial clinical and methodological heterogeneity across included models, including differences in predictors, outcome definitions, and validation strategies. The basic characteristics of the included studies are shown in Table 1.

Table 1

Basic characteristics for included studies

Author, year	Country	Objects	Sample size, n	Study time	Data source	Model types
Yang, 2025 (16)	China	IPF	491	2010–2020	GEO database (GSE47460)	LASSO, Ridge, Enet, Stepglm, SVM, GBM, RF, XGBoost, GLM, LDA, plsRglm, Naive Bayes
Liu, 2025 (17)	China	IPF	145	–	GEO database (GSE47460)	LASSO, Ridge, Enet, Stepglm, SVM, GBM, RF, XGBoost, GLM, LDA, plsRglm, Naive Bayes
Guo, 2025 (18)	China	IPF	48	–	GEO database (GSE24206, GSE10667)	RF, LASSO
Du, 2025 (19)	China	IPF/CTD-ILD	154/189	January 2014–December 2022.12	Cohort study	SVM, RF, SGD, KNN, XGBoost, LightGBM
Wei, 2025 (20)	China	IPF	103	–	GEO database (GSE150910)	RF, LASSO
Zhang, 2024 (21)	China	IPF	119	–	GEO database (GSE110147, GSE21369, GSE24206)	RF
Yang, 2024 (22)	China	IPF	476	February 2024	GEO database (GSE128033, GSE150910, GSE32527, GSE110147)	LASSO, RF, XGBoost
Ahmad, 2024 (23)	United States	ILD	300	2025–2018	Registration and research	ML
Mueller, 2024 (24)	United States	IPF/CTD-ILD	25/28	June 2020–June 2021	Cohort study	RF
Teramoto, 2022 (25)	Japan	IPF/non-IPF ILD	12/12	–	Cohort study	GAN, CNN
Refaee, 2022 (26)	Netherlands	IPF	122	2011–2018	Cohort study	Boruta

CNN, Convolutional Neural Network; CTD-ILD, connective tissue disease-associated ILD; Enet, Elastic net; GAN, Generative Adversarial Network; GBM, Gradient Boosting Machine; GEO, Gene Expression Omnibus; GLM, Generalized Linear Model; ILD, interstitial lung disease; IPF, idiopathic pulmonary fibrosis; KNN, K-Nearest Neighbors; LASSO, least absolute shrinkage and selection operator; LDA, Linear Discriminant Analysis; LightGBM, Light Gradient Boosting Machine; ML, Machine Learning; Naive Bayes, Naive Bayes Classifier; non-IPF ILD, non-IPF interstitial lung disease; plsRglm, Partial Least Squares Regression with Generalized Linear Model; RF, Random Forest; Ridge, Ridge Regression; SGD, Stochastic Gradient Descent; Stepglm, Stepwise Generalized Linear Model; SVM, Support Vector Machine; XGBoost, Extreme Gradient Boosting.

Model construction and performance

The diagnostic performance of AI-based models for IPF varied across data modalities and modeling strategies. In terms of model performance, gene-based models (16-18,20-22) generally demonstrated excellent discriminatory ability, with reported area under the receiver operating characteristic curve (AUC) values ranging from 0.6587 to 0.9986. This is higher than the performance of CT-based models (19,23,26), which showed moderate to high diagnostic accuracy, with validation AUC values ranging from 0.74 to 0.873. Models constructed using demographic, laboratory, and clinical variables (24,25) demonstrated moderate diagnostic performance, with AUC values typically between 0.632 and 0.902. Gene expression-based models, or those incorporating demographic, laboratory, and clinical variables, typically included 2–7 predictors. CT-based models predominantly relied on global or automatically extracted imaging features, with no explicit reporting of specific predictors. Regarding model presentation, 6 studies (19,21,22,24-26) reported performance metrics using ROC curves. Other models supplemented their findings with scoring tables, summary ratings, or nomograms, thereby enhancing interpretability and potential clinical utility. The characteristics of model construction and performance are shown in Table 2.

Table 2

Model construction and performance characteristics

Author, year	Model types	Factor	Model performance	Verified or not	Model presentation format	The final predictive factors
Yang, 2025 (16)	LASSO, Ridge, Enet, Stepglm, SVM, GBM, RF, XGBoost, GLM, LDA, plsRglm, Naive Bayes	Gene	GSE47460: AUC =0.9977, sensitivity =0.972, specificity =0.988, precision =0.981, accuracy =0.981	Y	Summary rating	GABARAPL1, UNC13B, CHPT1, CPNE4, CHRM3, TTR, PSD3, CHRM2
Liu, 2025 (17)	LASSO, Ridge, Enet, Stepglm, SVM, GBM, RF, XGBoost, GLM, LDA, plsRglm, Naive Bayes	Gene	GSE47460: AUC =0.9986, sensitivity =0.972, specificity =0.994, precision =0.991, accuracy =0.985;	Y	Summary rating	CAV1, KLF4, AGTR1
Guo, 2025 (18)	RF, LASSO	Gene	ASPN: AUC =0.94 (95% CI: 0.89−0.99); COMP: AUC =0.99 (95% CI: 0.98−1.00); GPX8: AUC =0.94 (95% CI: 0.87−1.00)	Y	Scoring table, nomogram	ASPN, COMP, GPX8
Du, 2025 (19)	SVM, RF, SGD, KNN, XGBoost, LightGBM	CT	SVM: AUC =0.847, RF: AUC =0.873, SGD: AUC =0.740, KNN: AUC =1.000, XGBoost: AUC =0.866, LightGBM: AUC =0.863	Y	ROC curve	CT
Wei, 2025 (20)	RF, LASSO	Gene	CDH3: AUC =0.949; CHRM3: AUC =0.952	Y	Scoring table, nomogram	NOS2, CDH3, COL17A1, CHRM3, ALPP, COL3A1, NCR1
Zhang, 2024 (21)	RF	Gene	AUC =0.936 (95% CI: 0.894−0.971)	Y	ROC curve	LRRC17, COMP, ASPN, POSTN, COL3A1, IL13RA2, CRTAC1, PEBP4, CA4
Yang, 2024 (22)	LASSO, RF, XGBoost	Gene	IER3: AUC =0.8313, KRT18: AUC =0.6587, RAB25: AUC =0.7277	Y	ROC curve	IER3, KRT18, RAB25
Ahmad, 2024 (23)	ML	Demographic information, medical history, medication use, clinical questionnaire, CT	Sensitivity =0.41 (95% CI: 0.303−0.523), specificity =0.866 (95% CI: 0.814−0.909)	Y	Subgroup prediction performance	CT
Mueller, 2024 (24)	RF	Age, race, gender, BMI, smoking history, comprehensive metabolic examination, whole blood cell count classification	AUC =0.928 (95% CI: 0.920−0.938), accuracy =0.902	Y	ROC curve	ALB, ALT, Lymph%, Hb, EOS%, WBC, Mon%, NE%
Teramoto, 2022 (25)	GAN, CNN	Pathology of lung tissue	Sensitivity: 0.658; specificity: 0.554; accuracy: 0.632	Y	ROC curve	Pathology
Refaee, 2022 (26)	Boruta	Pulmonary function, CT	Sensitivity: 0.66; specificity: 0.79; accuracy: 0.70; AUC: 0.82 (95% CI: 0.68−0.95)	Y	ROC curve	CT

ALB, albumin; ALT, Alanine Aminotransferase; AUC, area under the curve; BMI, body mass index; CI, confidence interval; CNN, Convolutional Neural Network; CT, computed tomography; Enet, Elastic net; EOS%, eosinophil percentage; GAN, Generative Adversarial Network; GBM, Gradient Boosting Machine; GLM, Generalized Linear Model; Hb, hemoglobin; KNN, K-Nearest Neighbors; LASSO, least absolute shrinkage and selection operator; LDA, Linear Discriminant Analysis; LightGBM, Light Gradient Boosting Machine; Lymph%, lymphocyte percentage; ML, machine learning; Mon%, monocyte percentage; Naive Bayes, Naive Bayes Classifier; NE%, neutrophil percentage; plsRglm, Partial Least Squares Regression with Generalized Linear Model; RF, Random Forest; Ridge, Ridge Regression; ROC, receiver operating characteristic curve; SGD, Stochastic Gradient Descent; Stepglm, Stepwise Generalized Linear Model; SVM, Support Vector Machine; WBC, white blood cell count; XGBoost, Extreme Gradient Boosting; Y, yes.

Model validation methods and results

Among the 11 included studies, external validation was reported in 5 studies (16-18,20,26), while the remaining 6 relied on internal validation only. External validation was predominantly performed using independent publicly available datasets, most studies adopting single-center or Gene Expression Omnibus (GEO) database validation rather than geographically or temporally distinct populations. In terms of discrimination, nearly all studies reported at least one performance metric, most frequently the AUC, which indicating moderate to excellent diagnostic performance across models. None of the included studies providing comprehensive calibration assessments such as calibration plots or goodness-of-fit statistics. This limited reporting precludes a robust evaluation of agreement between predicted and observed outcomes. In summary, most AI-based diagnostic models exhibited acceptable to excellent discrimination. However, substantial heterogeneity in validation design and inadequate calibration reporting remains major methodological limitations (Table 3).

Table 3

Model validation methods and results

Author, year	Verification method	Verify data source	Discriminability
Yang, 2025 (16)	External validation	GEO database (GSE32537, GSE53845, GSE150910)	GSE32537: AUC =0.9680, sensitivity =0.942, specificity =0.883, precision =0.89, accuracy =0.913; GSE53845: AUC =0.9795, sensitivity =0.9, specificity =0.966, precision =0.918, accuracy =0.947; GSE150910: AUC =1, sensitivity =1, specificity =1, precision =1, accuracy =1
Liu, 2025 (17)	External validation	GEO database (GSE32537, GSE53845, GSE150910)	GSE32537: AUC =0.9576, sensitivity =0.942, specificity =0.913, precision =0.915, accuracy =0.927; GSE53845: AUC =0.979, sensitivity =0.88, specificity =0.958, precision =0.898, accuracy =0.935; GSE150910: AUC =0.9938, sensitivity =1, specificity =0.95, precision =0.8, accuracy =0.958
Guo, 2025 (18)	External validation	GEO database (GSE32537)	AUC =0.880
Du, 2025 (19)	Internal validation	Cohort study	SVM: AUC =0.792; RF: AUC =0.739; SGD: AUC =0.714; KNN: AUC =0.717; XGBoost: AUC =0.718; LightGBM: AUC =0.737
Wei, 2025 (20)	External validation	GEO database (GSE10667)	AUC =0.968 (95% CI: 0.923–1.000)
Zhang, 2024 (21)	Internal validation	GEO database (GSE110147, GSE21369, GSE24206)	AUC =0.936 (95% CI: 0.894–0.971)
Yang, 2024 (22)	Internal validation	GEO database (GSE70866)	AUC =0.8527
Ahmad, 2024 (23)	Internal validation	Registration and research	Sensitivity =0.41 (95% CI: 0.303–0.523), specificity =0.866 (95% CI: 0.814–0.909)
Mueller, 2024 (24)	Internal validation	Cohort study	AUC =0.893, accuracy =0.921
Teramoto, 2022 (25)	Internal validation	Cohort study	AUC =0.843
Refaee, 2022 (26)	External validation	Publicly available LTRC, the publicly available RIA	P=0.32

AUC, area under the curve; CI, confidence interval; KNN, K-Nearest Neighbors; LTRC, Lung Tissue Research Consortium; RF, Random Forest; RIA, Radiomics Imaging Archive; SGD, Stochastic Gradient Descent; SVM, Support Vector Machine.

Risk of bias

In the field of research subjects, 5 studies (16,19,23-25) were rated as low risk of bias, and 6 studies (17,18,20-22) were rated as unclear risk of bias, mainly due to unclear descriptions of the inclusion and exclusion criteria for the research subjects. In the field of endings, 6 studies (16-18,20,21,25) were analyzed based on public data from the GEO database, and it is unclear whether the definition of their results is the same for all study subjects. Therefore, the evaluation results are unclear. A study (19) reported that the CT image acquisition time span of its patients was large, and it came from different models and parameters of examination equipment, so it was evaluated as high-risk. The remaining four studies (22-24,26) were evaluated as low-risk. In the field of data analysis, 6 studies (17-19,22,25,26) were rated as unclear in the field of analysis, mainly due to the inability to confirm the consistency between the predictive factors and their weights and the reported results. A total of 2 studies (20,21) were rated as high-risk due to their small sample size during the model construction process. Overall, in the included diagnostic model studies, 3 studies (19-21) were rated as high risk of bias, and 8 studies (16-18,22-26) were rated as unclear risk of bias. The bias risks of all models in various fields are shown in Table 4.

Table 4

PROBAST bias risk and adaptability evaluation results

Author, year	Risk of bias				Applicability			Overall
Author, year	Object	Predictor	Outcome	Analysis	Object	Predictor	Outcome	Risk of bias	Applicability
Yang, 2025 (16)	+	+	?	+	−	−	+	?	−
Liu, 2025 (17)	?	+	?	?	−	−	+	?	−
Guo, 2025 (18)	?	+	?	?	+	−	+	?	−
Du, 2025 (19)	+	+	−	?	+	+	+	−	+
Wei, 2025 (20)	?	+	?	−	−	−	+	−	−
Zhang, 2024 (21)	?	+	?	−	+	−	+	−	−
Yang, 2024 (22)	?	+	+	?	−	−	+	?	−
Ahmad, 2024 (23)	+	+	+	+	+	+	+	+	+
Mueller, 2024 (24)	+	+	+	+	+	+	+	+	+
Teramoto, 2022 (25)	+	+	?	?	+	−	+	?	−
Refaee, 2022 (26)	?	+	+	?	+	+	+	?	+

+, low risk; −, high risk; ?, unclear. PROBAST, Prediction Model Risk of Bias Assessment Tool.

Applicability evaluation

A total of 7 studies (16-18,20-22,25) were rated as high-risk in the field of applicability, and 4 studies (19,23,24,26) were rated as low-risk, indicating that the model has low applicability in practical applications. In the field of research subjects, the main reason for high risk is that the population for model establishment or validation comes from the database, while the population focused on in this systematic review is hospital patients, which has population heterogeneity. In the field of predictive factors, the main source of high risk is the difficulty in obtaining gene related predictive factors in clinical practice.

Report quality evaluation

Based on the TRIPOD Statement assessment of study reporting quality, the results indicate that the reporting quality of IPF diagnostic models is at a moderate level. The reporting was most complete in the TRIPOD Introduction section, while the reporting quality in the Methods section was relatively low. Among these, several key items (6a-c, 7a-e, 11, 12, 16a-e, 14a) showed significant reporting deficiencies. Detailed reporting performance across domains is presented in Figure 2.

Figure 2 TRIPOD signal question reporting quality chart. TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.

Discussion

In this study, we systematically summarized the current evidence on the application of AI in the diagnosis of IPF. Overall, the included studies demonstrated promising diagnostic performance, with several reporting AUC values exceeding 0.90, particularly for models based on genetic data and lung CT. However, we observed substantial discrepancies between model performance reported in the training datasets and that observed in internal validation cohorts, indicating a high risk of overfitting. In contrast, studies that performed external validation generally reported more conservative and potentially more reliable diagnostic performance. Nevertheless, methodological limitations, such as the use of internal validation alone, and heterogeneity stemming from inconsistent data sources and quality across studies hinder the clinical translation of these models. At present, AI-based diagnostic models for IPF remain largely in an exploratory stage and are not yet ready for routine clinical implementation.

The pathogenesis of IPF has not yet been fully elucidated, and its diagnosis continues to pose considerable challenges. Existing evidence suggests that genetic susceptibility, environmental exposures, and dysregulation of epigenetic modifications jointly contribute to abnormal lung tissue repair and the development of pulmonary fibrosis (27,28). Among these factors, gene mutations involved in telomere maintenance, epithelial barrier integrity, and immune regulation are considered important contributors to IPF pathogenesis (29,30). This biological basis provides theoretical support for the development of diagnostic models incorporating molecular and genetic information. On high-resolution CT, patients with IPF may present with a typical UIP pattern or a probable UIP pattern; however, identifying UIP patterns in clinical practice remains challenging (31). Diagnostic accuracy is highly dependent on the experience of radiologists or the diagnostic decisions made through multidisciplinary discussion (MDD) (32). Although international guidelines (33) strongly recommend MDD, it is time-consuming and resource-intensive, which limits its accessibility and reproducibility in primary or non-specialist hospitals (34). These diagnostic challenges have prompted researchers to explore more objective, reproducible, and scalable auxiliary diagnostic tools for IPF. In recent years, AI has shown substantial potential in disease diagnosis and prognostic assessment due to its ability to process and integrate high-dimensional, heterogeneous clinical data (35-37).

A notable finding of this review is that gene-related predictors are the most frequently incorporated variables in AI-based diagnostic models for IPF, and that most of these models exhibit good discriminative performance. This phenomenon may be explained by the central biological role of genetic susceptibility in IPF pathogenesis, as well as the capacity of AI to capture complex nonlinear relationships within high-dimensional datasets. In the field of CT imaging, our findings are broadly consistent with previous systematic reviews of AI-based IPF diagnostic models (38). A recent systematic review by Cong et al. (39) has quantitatively summarized the diagnostic accuracy of machine learning models for IPF, primarily focusing on indicators such as sensitivity and specificity. Unlike previous studies, we did not focus on the aggregated diagnostic performance, but systematically evaluated the bias risk and clinical applicability of AI-based diagnostic models. By applying the CHARMS checklist, PROBAST, and TRIPOD reporting standards, our review provides a more granular assessment of the strengths and limitations of existing models, thereby highlighting key methodological and translational challenges that are not fully captured by diagnostic accuracy metrics alone.

However, limitations should be acknowledged. The number of studies included in this review was limited, and substantial heterogeneity was observed in model performance and reporting quality, so a quantitative meta-analysis was not performed. Although many models demonstrated excellent discriminative ability, this alone does not guarantee reliability in real-world clinical settings. Model validation is a major methodological limitation in this study. Most included studies developed and validated models using public databases (e.g., GEO), rather than clinical cohorts collected across different time periods or geographic regions. Such validation strategies may lead to overestimation of model performance (40) and represent a major source of the elevated risk of bias identified in this study. Furthermore, comprehensive calibration assessments were rarely performed, limiting evaluation of agreement between predicted and observed outcomes. These issues substantially constrain the generalizability of existing models and may explain the performance decline observed when models were tested outside their development datasets. The PROBAST tool indicates that, to reduce the risk of overfitting in prediction model development, the number of outcome events should be at least 20 times the number of candidate predictors, corresponding to an events-per-variable (EPV) value greater than 20 (41). In the included studies, small sample sizes, and extensive feature selection procedures may also contribute to optimistic performance estimates, raising concerns regarding overfitting. Beyond these methodological issues, gene expression data are not routinely available in clinical practice, which further limits the scalability and implementation of these models. In contrast, models based on demographic characteristics, clinical information, lung CT imaging, or routine laboratory indicators, although generally exhibiting slightly lower discriminative performance than gene-based models, have lower data acquisition costs and greater clinical accessibility. With optimized study design, robust external validation, and improved reporting standards, these models may be more suitable for clinical implementation.

Therefore, future AI-based diagnostic models for IPF should, in addition to pursuing algorithmic performance, place greater emphasis on real-world clinical needs. This includes adopting standardized outcome definitions, conducting prospective multicenter studies for external validation, and ensuring adequate sample sizes to improve model reliability and applicability. Furthermore, comprehensive calibration assessments, including calibration plots and goodness-of-fit statistics, should be reported to evaluate the agreement between predicted and observed outcomes. Adherence to TRIPOD-AI or related reporting guidelines is also essential to enhance transparency, reproducibility, and interpretability of future studies.

Conclusions

AI models demonstrate promising diagnostic potential for IPF; however, the current evidence base is limited by high risk of bias, heterogeneity of data sources, and insufficient external validation. At present, these models should be considered investigational rather than clinically actionable. Future high-quality studies are required before AI-based diagnostic tools can be reliably implemented in routine IPF care.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the PRISMA reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2026-1-0307/rc

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2026-1-0307/prf

Funding: This work was supported by the National Science and Technology Major Project of China (Nos. 2024ZD0522900 and 2024ZD0522901); National Natural Science Foundation of China (No. 82505809); Natural Science Foundation of Henan Province (No. 2022JDZX118); and Henan Provincial Key R&D Program Joint Fund (No. 242301420020).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2026-1-0307/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Fan Y, Bao H, Pimple P, et al. Incidence rate and prevalence of idiopathic pulmonary fibrosis in the United States 2017-2022. Ann Am Thorac Soc 2026;23:208-18. [Crossref] [PubMed]
Gonnelli F, Eleangovan N, Smith U, et al. Incidence and survival of interstitial lung diseases in the UK in 2010-2019. ERJ Open Res 2025;11:00823-2024. [Crossref] [PubMed]
Nick P, Studnicka M, Gregor J, et al. Hospitalisations in patients with idiopathic pulmonary fibrosis: insights from the IPF-PRO Registry and EMPIRE Registry. ERJ Open Res 2026;12:00402-2025. [Crossref] [PubMed]
Kritikou S, Zafeiridis A, Markopoulou A, et al. Long-Term Pulmonary Rehabilitation Enhances Cerebral Oxygenation, Functional Capacity, and Psychological Health in Idiopathic Pulmonary Fibrosis. Med Sci Sports Exerc 2026;58:650-60. [Crossref] [PubMed]
Kaenmuang P, Yip WH, Kahai R, et al. Non-pharmacological Management of Fibrosing Interstitial Lung Diseases. Tuberc Respir Dis (Seoul) 2026;89:166-83. [Crossref] [PubMed]
Hozumi H, Miyashita K, Nakatani E, et al. Antifibrotics and mortality in idiopathic pulmonary fibrosis: external validity and avoidance of immortal time bias. Respir Res 2024;25:293. [Crossref] [PubMed]
Chong PL, Vaigeshwari V, Mohammed Reyasudin BK, et al. Integrating artificial intelligence in healthcare: applications, challenges, and future directions. Future Sci OA 2025;11:2527505. [Crossref] [PubMed]
Wang YY, Wu HW. Research progress of artificial intelligence in idiopathic pulmonary fibrosis. Int J Med Radiol 2024;47:48-52.
Moons KG, de Groot JA, Bouwmeester W, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med 2014;11:e1001744. [Crossref] [PubMed]
Wolff RF, Moons KGM, Riley RD, et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med 2019;170:51-8. [Crossref] [PubMed]
Moons KGM, Wolff RF, Riley RD, et al. PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann Intern Med 2019;170:W1-W33. [Crossref] [PubMed]
Moons KGM, Damen JAA, Kaul T, et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 2025;388:e082505. [Crossref] [PubMed]
Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594. [Crossref] [PubMed]
Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1-73. [Crossref] [PubMed]
Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024;385:e078378. [Crossref] [PubMed]
Yang L, Liu L, Liao Y, et al. Identification of diagnostic and prognostic phospholipid biomarkers in idiopathic pulmonary fibrosis via machine learning and in vivo validation. Hum Genomics 2025;19:144. [Crossref] [PubMed]
Liu X, Song J, Guo S, et al. Machine learning identifies lipid-associated genes and constructs diagnostic and prognostic models for idiopathic pulmonary fibrosis. Orphanet J Rare Dis 2025;20:354. [Crossref] [PubMed]
Guo Y, Jin Q, Kang Y, et al. Integrating machine learning and neural networks for new diagnostic approaches to idiopathic pulmonary fibrosis and immune infiltration research. PLoS One 2025;20:e0320242. [Crossref] [PubMed]
Du DF. Study on disease differentiation and prognosis prediction of IPF and CTD-ILD based on chest CT radiomic features. Lanzhou: Gansu University of Chinese Medicine; 2025.
Wei D, Jin FG, Gao YH. Molecular mechanism and diagnostic application of immunogenic cell death in idiopathic pulmonary fibrosis. J Air Force Med Univ 2025;46:739-45,54.
Zhang H, Hua H, Wang C, et al. Construction of an artificial neural network diagnostic model and investigation of immune cell infiltration characteristics for idiopathic pulmonary fibrosis. BMC Pulm Med 2024;24:458. [Crossref] [PubMed]
Yang Z, Yang Y, Han X, et al. Novel AT2 Cell Subpopulations and Diagnostic Biomarkers in IPF: Integrating Machine Learning with Single-Cell Analysis. Int J Mol Sci 2024;25:7754. [Crossref] [PubMed]
Ahmad Y, Mooney J, Allen IE, et al. A Machine Learning System to Indicate Diagnosis of Idiopathic Pulmonary Fibrosis Non-Invasively in Challenging Cases. Diagnostics (Basel) 2024;14:830. [Crossref] [PubMed]
Mueller AN, Miller HA, Taylor MJ, et al. Identification of Idiopathic Pulmonary Fibrosis and Prediction of Disease Severity via Machine Learning Analysis of Comprehensive Metabolic Panel and Complete Blood Count Data. Lung 2024;202:139-50. [Crossref] [PubMed]
Teramoto A, Tsukamoto T, Michiba A, et al. Automated Classification of Idiopathic Pulmonary Fibrosis in Pathological Images Using Convolutional Neural Network and Generative Adversarial Networks. Diagnostics (Basel) 2022;12:3195. [Crossref] [PubMed]
Refaee T, Bondue B, Van Simaeys G, et al. A Handcrafted Radiomics-Based Model for the Diagnosis of Usual Interstitial Pneumonia in Patients with Idiopathic Pulmonary Fibrosis. J Pers Med 2022;12:373. [Crossref] [PubMed]
Hu W, Xu Y. Transcriptomics in idiopathic pulmonary fibrosis unveiled: a new perspective from differentially expressed genes to therapeutic targets. Front Immunol 2024;15:1375171. [Crossref] [PubMed]
Zang N, Wu Y, Li P, et al. Viral Pathogens and Pulmonary Fibrosis: EMT-Driven Mechanisms and Insights From Traditional Chinese Medicine. Rev Med Virol 2026;36:e70118. [Crossref] [PubMed]
Lv H, Qian X, Tao Z, et al. HOXA5-induced lncRNA DNM3OS promotes human embryo lung fibroblast fibrosis via recruiting EZH2 to epigenetically suppress TSC2 expression. J Thorac Dis 2024;16:1234-46. [Crossref] [PubMed]
Tu M, Wei T, Jia Y, et al. Molecular mechanisms of alveolar epithelial cell senescence and idiopathic pulmonary fibrosis: a narrative review. J Thorac Dis 2023;15:186-203. [Crossref] [PubMed]
Chae KJ, Hwang HJ, Duarte Achcar R, et al. Central Role of CT in Management of Pulmonary Fibrosis. Radiographics 2024;44:e230165. [Crossref] [PubMed]
Comes A, Sgalla G, Ielo S, et al. Challenges in the diagnosis of idiopathic pulmonary fibrosis: the importance of a multidisciplinary approach. Expert Rev Respir Med 2023;17:1-11. [Crossref] [PubMed]
Raghu G, Remy-Jardin M, Richeldi L, et al. Idiopathic Pulmonary Fibrosis (an Update) and Progressive Pulmonary Fibrosis in Adults: An Official ATS/ERS/JRS/ALAT Clinical Practice Guideline. Am J Respir Crit Care Med 2022;205:e18-47. [Crossref] [PubMed]
Richeldi L, Launders N, Martinez F, et al. The characterisation of interstitial lung disease multidisciplinary team meetings: a global study. ERJ Open Res 2019;5:00209-2018. [Crossref] [PubMed]
Wang X, Xia X, Hou Y, et al. Diagnosis of early idiopathic pulmonary fibrosis: current status and future perspective. Respir Res 2025;26:192. [Crossref] [PubMed]
Thillai M, Oldham JM, Ruggiero A, et al. Deep Learning-based Segmentation of Computed Tomography Scans Predicts Disease Progression and Mortality in Idiopathic Pulmonary Fibrosis. Am J Respir Crit Care Med 2024;210:465-72. [Crossref] [PubMed]
Kim H, Jin KN, Yoo SJ, et al. Deep Learning for Estimating Lung Capacity on Chest Radiographs Predicts Survival in Idiopathic Pulmonary Fibrosis. Radiology 2023;306:e220292. [Crossref] [PubMed]
Khalid A, Mushtaq MM, Sattar S, et al. Radiomics-Based Artificial Intelligence and Machine Learning Approach for the Diagnosis and Prognosis of Idiopathic Pulmonary Fibrosis: A Systematic Review. Cureus 2025;17:e87461. [Crossref] [PubMed]
Cong L, Chen Y, He X, et al. Diagnosis accuracy of machine learning for idiopathic pulmonary fibrosis: a systematic review and meta-analysis. Eur J Med Res 2025;30:288. [Crossref] [PubMed]
Kaul T, Damen JAA, Wynants L, et al. Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application. J Clin Epidemiol 2025;181:111732. [Crossref] [PubMed]
Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441. [Crossref] [PubMed]

Cite this article as: Yan RN, Hu HY, Ma ZR, Zhao WH, Xu LL, Zang DY, Yang SG, Yu XQ. A systematic review of artificial intelligence-based diagnosis models for idiopathic pulmonary fibrosis. J Thorac Dis 2026;18(5):546. doi: 10.21037/jtd-2026-1-0307

A systematic review of artificial intelligence-based diagnosis models for idiopathic pulmonary fibrosis

Highlight box

Introduction

Methods

Registration of the study

Search strategy

Inclusion and exclusion criteria

Study selection

Data extraction

Risk of bias and applicability evaluation

Report quality evaluation

Results

Study selection

Study characteristics

Table 1

Model construction and performance

Table 2

Model validation methods and results

Table 3

Risk of bias

Table 4

Applicability evaluation

Report quality evaluation

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share