Identification of disease subtypes associated with prognosis in patients with intermediate-high-risk pulmonary embolism based on hierarchical cluster analysis
Highlight box
Key findings
• Hierarchical clustering discovered disease subtypes related to prognosis by clustering the characteristics of intermediate-high-risk pulmonary embolism (PE) patients at admission.
What is known and what is new?
• There is still controversy regarding thrombolytic therapy for acute intermediate-high-risk PE.
• Applying machine learning-based classification to patients with intermediate-high-risk PE may contribute to better risk stratification.
What is the implication, and what should change now?
• Precision-based stratification of intermediate-high-risk PE emerges as a critical pathway to advance therapeutic decision-making and prognostic outcomes.
Introduction
Background
Pulmonary embolism (PE) is the third cardiovascular disease threatening human health, after myocardial infarction and stroke (1). In the past few decades, the incidence of PE has been increasing, with an estimated annual incidence of 39 to 115 cases per 100,000 inhabitants (2). The mortality rate for acute PE ranges from 65% to less than 1%, depending on the clinical presentation (3,4). The significant difference in the fatality rate of PE suggests that PE is a heterogeneous disease with different clinical characteristics. Prognostic risk assessment and stratification of patients with PE are of great significance for guiding the diagnosis and treatment strategy of PE to reduce the mortality of PE (5). The 2019 European Society of Cardiology (ESC) guidelines divided PE into three groups: high risk, intermediate risk, and low risk, with the intermediate risk group being divided into intermediate-high-risk group and intermediate-low-risk group (2). However, current intermediate-high-risk PE encompasses heterogeneous subgroups with varied prognoses, leading to thrombolysis controversies (6-8). Thus, precision-based stratification of intermediate-high-risk PE emerges as a critical pathway to advance therapeutic decision-making and prognostic outcomes.
Rationale and knowledge gap
In recent years, machine learning techniques have been applied to risk stratification in clinical medicine and have successfully improved treatments (9-11). Unsupervised learning belongs to a subclass of machine learning through which previous studies have successfully identified intrinsic subpopulations within heterogeneous populations, such as patients with diabetes and obstructive pulmonary disease (11,12). However, whether machine learning improves outcomes for individuals at moderate PE risk remains to be investigated.
Objective
By performing cluster analysis on intermediate-high-risk PE cases, this research investigated how distinct patient phenotypes correlate with clinical outcomes. We present this article in accordance with the STROBE reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-556/rc).
Methods
Study design and population
This was a retrospective cohort study. We recruited intermediate-high-risk PE patients admitted to two tertiary hospitals (Fujian Provincial Hospital South Hospital from April 2015 to December 2019 and Fujian Provincial Hospital from March 2013 to December 2019). Data of all enrolled patients were extracted from medical records by two independent clinicians (X.W. and S.Z.). Researcher (X.P.) judged the differences in interpretation between the two primary reviewers. This retrospective cohort study enrolled adult patients (≥18 years) diagnosed with intermediate-high-risk PE per the 2019 ESC guidelines. The diagnostic criteria included concurrent right ventricular dysfunction and elevated cardiac biomarkers, excluding individuals with systemic hypotension. Exclusion criteria comprised: age <18 years or pregnancy; PE classified as high-risk, intermediate-low-risk, or low-risk; in-hospital mortality prior to study inclusion. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and received ethical approval from Fujian Provincial Hospital Ethics Committee (No. K2023-01-010). Written informed consent was obtained from all participants prior to enrollment.
Data extraction
Basic information at admission: age, sex, complications with chronic lung disease, chronic heart disease and diabetes, Pulmonary Embolism Severity Index (PESI) score, mean arterial pressure, oxygen and index, B-type brain natriuretic peptide (BNP), D-dimer. Treatment: thrombolytic therapy or anticoagulant therapy. Clinical outcomes: the length of stay in the intensive care unit (ICU), total length of stay, hospitalization expenses, 1-year PE mortality, 3-year PE mortality and the degree of thrombus absorption at 30 days. Computed tomography pulmonary angiography (CTPA) 30 days after treatment was used to determine the extent of thrombus absorption. All patients underwent CTPA examination and were interpreted by two institutional board-certified radiologists. The evaluation grading of thrombus absorption range is as follows: (I) thrombus area reduction ≤25%; (II) reduction in thrombus area by >25%, but ≤50%; (III) reduction of thrombus area by >50%, but ≤75%; (IV) reduction in thrombus area >75%.
Statistical analysis
Data were preprocessed using R software (version 3.6.3). Standard deviation (SD) was used to describe the patient’s characteristic data. First, all variables were standardized by a scaled function, and non-normally distributed variables were transformed into a logarithmic scale. Second, we performed the hierarchical cluster analysis using the average method to estimate an optimal number of clusters. A one-way analysis of variance was subsequently employed to compare demographic profiles, laboratory parameters, and treatment outcomes across the predefined patient clusters. P<0.05 was considered statistically significant.
Results
Study population
In this cohort study, 1,085 patients diagnosed with PE were screened. Of 81 patients diagnosed with acute intermediate-high-risk PE, 2 patients who died in the hospital were excluded. A total of 79 subjects meeting the inclusion criteria were ultimately enrolled in this analysis. Figure 1 shows a flowchart of the participant selection procedure.
Clustering and the comparison of patients in 3 clusters
Hierarchical clustering analysis was conducted on the 79 enrolled patients utilizing the 10 predefined variables, with resultant dendrogram visualization presented in Figure 2. The dendrogram’s x-axis represents observation units, while the y-axis corresponds to linkage distances, reflecting the clustering objective of minimizing intra-group variance while maximizing inter-group separation. Optimal stratification was achieved by pruning the dendrogram at a height yielding three clinically distinct clusters (cluster 1: n=6; cluster 2: n=67; cluster 3: n=6). Comparative analyses of baseline characteristics and laboratory profiles across clusters are systematically detailed in Table 1.
Table 1
| Basic information | Cluster 1 (N=6) | Cluster 2 (n=67) | Cluster 3 (N=6) | P overall |
|---|---|---|---|---|
| Therapy | 0.21 | |||
| Thrombolytic therapy | 1 (16.7) | 25 (37.3) | 4 (66.7) | |
| Anticoagulant therapy | 5 (83.3) | 42 (62.7) | 2 (33.3) | |
| Age (years) | 72.0±8.02 | 63.9±11.6 | 74.3±13.1 | 0.04 |
| Sex | >0.99 | |||
| Male | 3 (50.0) | 38 (56.7) | 3 (50.0%) | |
| Female | 3 (50.0) | 29 (43.3) | 3 (50.0%) | |
| Chronic lung disease | <0.001 | |||
| No | 0 | 67 (100.0) | 6 (100.0) | |
| Yes | 6 (100.0) | 0 | 0 | |
| Chronic heart disease | 0.003 | |||
| No | 3 (50.0) | 45 (67.2) | 0 | |
| Yes | 3 (50.0) | 22 (32.8) | 6 (100.0) | |
| Diabetes | <0.001 | |||
| No | 5 (83.3) | 67 (100.0) | 0 | |
| Yes | 1 (16.7) | 0 | 6 (100.0) | |
| PESI score | 117±23.4 | 103±29.0 | 114±24.4 | 0.39 |
| MBP (mmHg) | 90.8±13.0 | 92.7±14.8 | 87.6±19.5 | 0.71 |
| Oxygen index | 194±70.4 | 210±80.3 | 240±53.4 | 0.57 |
| BNP, Pg/mL | 3,051±1,800 | 3,403±2,393 | 4,628±3,314 | 0.46 |
| D-dimer, mg/L | 10.4±8.64 | 10.6±7.99 | 14.0±12.3 | 0.63 |
Data are presented as mean ± standard deviation or n (%). BNP, B-type brain natriuretic peptide; MBP, mean blood pressure; PESI, Pulmonary Embolism Severity Index.
The one-way analysis of variance showed statistically significant differences in age, chronic lung disease, chronic heart disease and diabetes among clusters (P<0.05). It was not statistically significant among clusters in sex, PESI score, mean arterial pressure, oxygen and index, BNP and D-dimer and therapy (P>0.05) (Table 1). Age was significantly higher in clusters 1 and 3, when compared with those of cluster 2. All patients had chronic lung disease and a few had diabetes in cluster 1. Cluster 2 was characterized by the absence of chronic lung disease and diabetes in all patients, and about two-thirds of patients had chronic heart disease. Cluster 3 was characterized in all patients with chronic heart disease and diabetes, without chronic lung disease.
Prognosis of the patients in each cluster
Table 2 reports the one-way analysis of variance results of the observed clinical outcomes. The Chi-squared test analysis showed significant differences in the length of stay in the ICU, total length of stay and hospitalization expenses among clusters (P=0.03, P=0.052, P<0.001, respectively). The Chi-squared test analysis showed significant differences in 1- and 3-year PE mortality (P<0.001, P=0.003, respectively). The Fisher’s exact test analysis showed significant differences in the 1- and 3-year PE mortality (P=0.005, P=0.03, respectively). The analysis showed significant differences in the length of stay in the ICU, total length of stay, and hospitalization expenses among clusters (P=0.03, P=0.052, P<0.001, respectively). The length of stay in the ICU was significantly longer in cluster 3 (10.5±11.5 days) when compared with cluster 1 (0 days; P=0.02). The total length of stay was significantly longer in cluster 3 (20±10.5 days) when compared with cluster 1 (11.5±3.99 days) and cluster 2 (14.4±5.91 days; P=0.052 and P=0.09 for cluster 3 vs. 1 and cluster 3 vs. 2, respectively). Hospitalization expenses were significantly higher in cluster 3 (64,251±28,270 yuan) when compared with cluster 1 (17,940±6,058 yuan) and cluster 2 (29,022±18,202 yuan; P=0.003 and P=0.002 for cluster 3 vs. 1 and cluster 3 vs. 2, respectively). The CTPA-confirmed absorption was comparable for three clusters (P=0.95).
Table 2
| Prognosis | Cluster 1 (N=6) | Cluster 2 (N=67) | Cluster 3 (N=6) | P (cluster 3 vs. 1) |
P (cluster 3 vs. 2) |
P overall | P (Chi-squared test) | P (Fisher’s exact) |
|---|---|---|---|---|---|---|---|---|
| ICU stay (days) | 0.00±0.00 | 4.87±6.40 | 10.5±11.5 | 0.02** | 0.13 | 0.03** | ||
| Hospital stay (days) | 11.5±3.99 | 14.4±5.91 | 20.0±10.5 | 0.052* | 0.09* | 0.053* | ||
| Total expenses (yuan) | 17,940±6,058 | 29,022±18,202 | 64,251±38,270 | 0.003*** | 0.002*** | <0.001*** | ||
| Outcome† | 0.95 | |||||||
| I | 0 | 2 (2.99) | 0 | |||||
| II | 1 (16.7) | 16 (23.9) | 2 (33.3) | |||||
| III | 3 (50.0) | 23 (34.3) | 2 (33.3) | |||||
| IV | 2 (33.3) | 26 (38.8) | 2 (33.3) | |||||
| 1-year PE mortality | <0.001*** | 0.005*** | ||||||
| Survival | 6 (100.0) | 65 (97.0) | 3 (50.0) | |||||
| Dead | 0 | 2 (2.99) | 3 (50.0) | |||||
| 3-year PE mortality | 0.003*** | 0.03*** | ||||||
| Survival | 6 (100.0) | 62 (92.5) | 3 (50.0) | |||||
| Dead | 0 | 5 (7.46) | 3 (50.0) |
Data are presented as mean ± standard deviation or n (%). †, CTPA-confirmed absorption at 30 days. *, P<0.1; **, P<0.05; ***, P<0.01. CTPA, computed tomography pulmonary angiography; ICU, intensive care unit; PE, pulmonary embolism.
Discussion
Key findings
This analysis employed agglomerative hierarchical clustering on a cohort of 79 intermediate-high-risk PE patients, each characterized by 11 standardized continuous variables. The algorithm delineated three phenotypically distinct subgroups: cluster 1 (n=6), cluster 2 (n=67), and cluster 3 (n=6). By clustering the characteristics of intermediate-high-risk PE patients at admission and identifying disease subtypes related to prognosis. Compared with cluster 1, the length of stay in the ICU, total length of stay and hospitalization expenses in cluster 3 were higher. The total length of stay, hospitalization expenses, 1- and 3-year PE mortality in cluster 3 were higher when compared with cluster 2. Cluster 3 was characterized by advanced age, chronic heart disease and diabetes. It can be seen that age, diabetes and chronic heart disease were risk factors of acute intermediate-high-risk PE.
Strengths and limitations
This investigation represents, to our knowledge, the inaugural application of hierarchical clustering analysis in intermediate-high-risk PE management. Nevertheless, several methodological constraints warrant consideration: (I) the retrospective design and limited sample size (n=79) may constrain generalizability, compounded by undetermined long-term prognostic implications; (II) absence of external validation cohorts precludes confirmation of cluster stability and clinical reproducibility.
Comparison with similar research and explanations of findings
Studies have suggested that advanced age is associated with poor prognosis of PE and it is an independent risk factor for 3-month mortality of PE (13,14), which is consistent with our studies. Studies have shown that diabetic hyperglycemia can lead to fibrinolysis and coagulation system imbalance, and blood hypercoagulation is easy to be complicated by deep venous thrombosis of lower limbs and PE (15). A Spanish population study found that the incidence of PE was higher in both men and women with type 2 diabetes than in those without diabetes, and type 2 diabetes was an independent risk factor for all-cause mortality from PE (16). In this study, all patients in cluster 3 had diabetes, and the prognosis was worse than those without diabetes, which was consistent with previous studies. Our study showed that patients with chronic heart disease had a poor prognosis. Long-term follow-up studies showed that the all-cause fatality rate of PE patients with cardiovascular disease risk factors was 2.2 times higher than that of patients without cardiovascular disease (4). The findings suggest that hypertension is strongly associated with PE (17). Studies have shown that co-existing chronic heart failure (HF) is an independent risk factor for PE. Three interconnected pathological processes drive thromboembolic complications in HF: impaired cardiac output causing venous stasis, prothrombotic state potentiated by hemoconcentration and platelet hyperreactivity, coupled with endothelial dysfunction disrupting the thromboresistant interface (14,18,19). Study show that the VB-Net DL model based on CTPA could conveniently and efficiently detect and quantitatively evaluate PE (20). But unfortunately, the CTPA-confirmed absorption was comparable for three clusters in our study. This might require further verification through studies with a larger sample size. The recognized independent risk factors for PE include deep vein thrombosis, major orthopedic surgery, malignant tumor, pregnancy status, etc. However, the attention and research on other concomitant diseases, especially chronic diseases (such as diabetes mellitus, coronary heart disease) need to be improved. Acute PE produces a wide range of clinical severity. Comorbidity is the most important prognostic factor for short-term mortality in patients with hemodynamically stable acute PE (21). The results of this study suggest that, in clinical practice, the management and treatment of elderly patients, diabetes mellitus, chronic heart disease with PE is more in need of “multi-pronged” comprehensive treatment, and effective prevention is necessary for people at high risk of PE.
As an unsupervised exploratory analytical approach, clustering methodology selection critically impacts research outcomes. Our choice of hierarchical clustering over alternative algorithms was predicated on its dual advantages in handling limited sample sizes (n=79) and providing clinically interpretable stratification. This technique’s dendrogram visualization elucidates classification trajectories while revealing potential pathophysiological commonalities within clusters—a particularly valuable feature in the era of multidimensional electronic health records. Compared with conventional statistical paradigms, hierarchical clustering’s capacity to decipher latent patterns in high-dimensional clinical datasets holds significant promise for the advancing precision medicine framework (22). Several recent studies have applied cluster analysis to re-stratify some established disease definitions, such as traumatic brain injury (23) and hypertension (24). Still others analyzed for predicting 1-year mortality after starting hemodialysis by hierarchical clustering analysis (25). In the present study, stratification of intermediate-risk PE patients more strictly by clinical features discovered disease subtypes related to prognosis and could help tailor and target early treatment to patients who would benefit most from it, thereby allowing for a more precision medicine approach.
Implications and actions needed
In this study, hierarchical clustering discovered a subgroup with a poor prognosis in intermediate-high-risk PE patients at admission. Therefore, further research is needed to be certified for a large sample size by a prospective study and verify our results on an external dataset.
Conclusions
In conclusion, hierarchical clustering discovered a disease subtype related to prognosis by clustering the characteristics of intermediate-high-risk PE patients at admission. Applying machine learning-based classification to patients with intermediate-high-risk PE may contribute to better risk stratification.
Acknowledgments
The authors would like to thank the participants for providing the information used in this study and for kindly making arrangements for the data collection.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-556/rc
Data Sharing Statement: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-556/dss
Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-556/prf
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-556/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and received ethical approval from Fujian Provincial Hospital Ethics Committee (No. K2023-01-010). Written informed consent was obtained from all participants prior to enrollment.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Wendelboe AM, Raskob GE. Global Burden of Thrombosis: Epidemiologic Aspects. Circ Res 2016;118:1340-7. [Crossref] [PubMed]
- Konstantinides SV, Meyer G, Becattini C, et al. 2019 ESC Guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the European Respiratory Society (ERS). Eur Heart J 2020;41:543-603. [Crossref] [PubMed]
- Kasper W, Konstantinides S, Geibel A, et al. Management strategies and determinants of outcome in acute major pulmonary embolism: results of a multicenter registry. J Am Coll Cardiol 1997;30:1165-71. [Crossref] [PubMed]
- Ng AC, Chung T, Yong AS, et al. Long-term cardiovascular and noncardiovascular mortality of 1023 patients with confirmed acute pulmonary embolism. Circ Cardiovasc Qual Outcomes 2011;4:122-8. [Crossref] [PubMed]
- Hepburn-Brown M, Darvall J, Hammerschlag G. Acute pulmonary embolism: a concise review of diagnosis and management. Intern Med J 2019;49:15-27. [Crossref] [PubMed]
- Zhao T, Ni J, Hu X, et al. The Efficacy and Safety of Intermittent Low-Dose Urokinase Thrombolysis for the Treatment of Senile Acute Intermediate-High-Risk Pulmonary Embolism: A Pilot Trial. Clin Appl Thromb Hemost 2018;24:1067-72. [Crossref] [PubMed]
- Bhamani A, Pepke-Zaba J, Sheares K. Lifting the fog in intermediate-risk (submassive) PE: full dose, low dose, or no thrombolysis? F1000Res 2019;8:F1000 Faculty Rev-330.
- Wang L, Yu C, Hu K, et al. Research progress in interventional therapy for acute intermediate-high-risk and high-risk pulmonary embolism. J Thorac Dis 2024;16:7958-77. [Crossref] [PubMed]
- Shah SJ, Katz DH, Selvaraj S, et al. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation 2015;131:269-79. [Crossref] [PubMed]
- Komorowski M, Celi LA, Badawi O, et al. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 2018;24:1716-20. [Crossref] [PubMed]
- Hirai K, Shirai T, Suzuki M, et al. A clustering approach to identify and characterize the asthma and chronic obstructive pulmonary disease overlap phenotype. Clin Exp Allergy 2017;47:1374-82. [Crossref] [PubMed]
- Ahlqvist E, Storm P, Käräjämäki A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol 2018;6:361-9. [Crossref] [PubMed]
- Laporte S, Mismetti P, Décousus H, et al. Clinical predictors for fatal pulmonary embolism in 15,520 patients with venous thromboembolism: findings from the Registro Informatizado de la Enfermedad TromboEmbolica venosa (RIETE) Registry. Circulation 2008;117:1711-6. [Crossref] [PubMed]
- McHugh KB, Visani L, DeRosa M, et al. Gender comparisons in pulmonary embolism (results from the International Cooperative Pulmonary Embolism Registry [ICOPER]). Am J Cardiol 2002;89:616-9.
- Schmitt VH, Hobohm L, Münzel T, et al. Impact of diabetes mellitus on mortality rates and outcomes in myocardial infarction. Diabetes Metab 2021;47:101211. [Crossref] [PubMed]
- Jiménez-García R, Albaladejo-Vicente R, Hernandez-Barrera V, et al. Type 2 Diabetes Is a Risk Factor for Suffering and for in-Hospital Mortality with Pulmonary Embolism. A Population-Based Study in Spain (2016-2018). Int J Environ Res Public Health 2020;17:8347. [Crossref] [PubMed]
- Yang G, Nie S. Risk factors for pulmonary embolism: a case-control study. J Thorac Dis 2025;17:1552-60. [Crossref] [PubMed]
- Lip GY, Gibbs CR. Does heart failure confer a hypercoagulable state? Virchow's triad revisited. J Am Coll Cardiol 1999;33:1424-6. [Crossref] [PubMed]
- Fanola CL, Norby FL, Shah AM, et al. Incident Heart Failure and Long-Term Risk for Venous Thromboembolism. J Am Coll Cardiol 2020;75:148-58. [Crossref] [PubMed]
- Qiao Y, Gao Y, Chen Y, et al. Quantitative assessment and risk stratification of random acute pulmonary embolism cases using a deep learning model based on computed tomography pulmonary angiography images. Quant Imaging Med Surg 2025;15:1950-62. [Crossref] [PubMed]
- Penaloza A, Roy PM, Kline J. Risk stratification and treatment strategy of pulmonary embolism. Curr Opin Crit Care 2012;18:318-25. [Crossref] [PubMed]
- Shah P, Kendall F, Khozin S, et al. Artificial intelligence and machine learning in clinical development: a translational perspective. NPJ Digit Med 2019;2:69. [Crossref] [PubMed]
- Åkerlund CAI, Holst A, Stocchetti N, et al. Clustering identifies endotypes of traumatic brain injury in an intensive care cohort: a CENTER-TBI study. Crit Care 2022;26:228. [Crossref] [PubMed]
- Vaura FC, Salomaa VV, Kantola IM, et al. Unsupervised hierarchical clustering identifies a metabolically challenged subgroup of hypertensive individuals. J Clin Hypertens (Greenwich) 2020;22:1546-53. [Crossref] [PubMed]
- Komaru Y, Yoshida T, Hamasaki Y, et al. Hierarchical Clustering Analysis for Predicting 1-Year Mortality After Starting Hemodialysis. Kidney Int Rep 2020;5:1188-95. [Crossref] [PubMed]


