Tumor prognostic risk stratification based on pseudo-time analysis of single-cell sequencing for patients with lung adenocarcinoma
Original Article

Tumor prognostic risk stratification based on pseudo-time analysis of single-cell sequencing for patients with lung adenocarcinoma

Huanle Jin1,2#, Huandi Jin1,2#, Ting Wu1,2, Keling Huang1,2, Jiangnan Lin1,2 ORCID logo, Linyu Wu1,2 ORCID logo, Chen Gao1,2 ORCID logo

1Department of Radiology, The First Affiliated Hospital of Zhejiang Chinese Medical University (Zhejiang Provincial Hospital of Chinese Medicine), Hangzhou, China; 2The First School of Clinical Medicine, Zhejiang Chinese Medical University, Hangzhou, China

Contributions: (I) Conception and design: C Gao, L Wu, J Lin; (II) Administrative support: C Gao, L Wu, J Lin; (III) Provision of study materials or patients: H Jin, H Jin; (IV) Collection and assembly of data: T Wu, K Huang; (V) Data analysis and interpretation: H Jin, H Jin, C Gao; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

#These authors contributed equally to this work.

Correspondence to: Chen Gao, MM; Linyu Wu, MD; Jiangnan Lin, MM. Department of Radiology, The First Affiliated Hospital of Zhejiang Chinese Medical University (Zhejiang Provincial Hospital of Chinese Medicine), 54 Youdian Road, Hangzhou 310016, China; The First School of Clinical Medicine, Zhejiang Chinese Medical University, Hangzhou, China. Email: doctor_gaochen@zcmu.edu.cn; wulinyu@zcmu.edu.cn; linjiangnan@zcmu.edu.cn.

Background: Accurate prognostic risk stratification for lung adenocarcinoma (LUAD) remains a critical challenge. This study employs single-cell pseudo-time series analysis to identify pseudo-time differential genes (PTDGs) associated with LUAD development and constructs tumor progression-related prognostic risk stratification for patients with LUAD.

Methods: This study conducted quality control for single-cell RNA sequencing (RNA-seq) data firstly. Then, malignant tumor cells of the epithelium were identified using cell-cluster markers and InferCNV. Trajectory analysis with Monocle identified PTDGs, further refined to 2,597 candidate genes. The prognostic signature of PTDGs was constructed by using Cox, least absolute shrinkage and selection operator (LASSO) and random forest (RF) analyses. In addition, multicenter datasets and immunohistochemistry analysis were used to evaluate the expression of the identified PTDGs between the two groups.

Results: The developmental trajectory of the LUAD epithelium was plotted, and 13 PTDGs were identified to be highly associated with the prognosis of LUAD. Based on the prognostic risk score, The Cancer Genome Atlas (TCGA)-LUAD patients were well stratified into low- and high-risk groups. The PTDGs-based risk model demonstrated good performance in the training, internal and external validation datasets, as well as being compared with existing risk score formulas. In addition, multi-center datasets and immunohistochemistry revealed a significant up-regulation of DDIT4, FURIN, PTTG1 and RIPK2 at transcriptional and protein expression levels in LUAD tissues compared to normal tissues.

Conclusions: PTDGs are potential biological markers for prognostic risk stratification of patients with LUAD.

Keywords: Pseudo-time analysis; single-cell RNA sequencing (single-cell RNA-seq); risk stratification; lung adenocarcinoma (LUAD)


Submitted Aug 02, 2025. Accepted for publication Nov 05, 2025. Published online Dec 29, 2025.

doi: 10.21037/jtd-2025-1581


Highlight box

Key findings

• This study integrated single-cell RNA sequencing (RNA-seq) with bulk RNA transcriptome and constructed an accurate prognostic risk stratification for patients with lung adenocarcinoma (LUAD) based on a single-cell, pseudo-time series analysis.

• The identified pseudo-time differential genes (PTDGs) were comprehensively validated across multiple dimensions, highlighting their critical roles in LUAD through clinical prognosis, transcriptional profiles, and protein expression levels.

What is known and what is new?

• Most biomarkers are identified through static comparisons of transcriptomic differences between normal and tumor tissues; few studies have considered the dynamic gene expression during tumor progression.

• This study employs single-cell pseudotime analysis to explore the dynamic gene expression patterns underlying LUAD progression and has identified PTDGs associated with LUAD development.

What is the implication, and what should change now?

• The tumor biomarkers identified through pseudotime analysis might be potential predictive biomarkers for LUAD patients.


Introduction

Lung cancer is the second most widespread and deadliest form of cancer globally (1). In 2025, it was projected that there would be an estimated 226,650 new cases of lung cancer and over 124,730 deaths in the United States alone (1). In pathological classification, about 85% of lung cancer cases are categorized as non-small cell lung cancer (NSCLC), among which lung adenocarcinoma (LUAD) is one of the most common subtypes (2). Although approaches such as surgery, targeted therapy, and others have improved outcomes in patients with LUAD, significant challenges remain. This involves identifying novel driver gene alterations to broaden the scope of individuals benefiting from targeted therapy. As such, there is a great need to discover dependable biological markers for predicting the prognosis of patients with LUAD.

In recent years, many studies have explored potential prognostic markers for LUAD using traditional RNA sequencing (RNA-seq) data, contributing to the knowledge of tumorigenesis and tumor progression (3,4). Based on the average gene expression of multiple cells, traditional RNA-seq can provide an overview of gene expression profiles. However, this approach fails to capture the intratumor heterogeneity by which different cell types and subpopulations may exhibit distinct gene expression profiles (5). Therefore, the predictive characteristics relied on traditional transcriptomes are inevitably limited (6).

Single-cell RNA-seq is one approach to dissect the heterogeneity of complex biological systems (7,8). This approach offers researchers a novel avenue for studying lung cancer, facilitating a comprehensive understanding of the mechanisms involved in its progression (9,10). In addition, this approach helps identify tumor-associated cells and specific biomarkers, which can be utilized for predictive assessment and treatment response prediction in patients with lung cancer (11,12). Moreover, previous single-cell RNA studies have primarily focused on differentially expressed genes (DEGs) in normal and tumor cells, lacking mechanistic investigations into the time sequence of the transition from normal to tumor cells (8,13). Pseudotime analysis (14) can utilize known start and end points of tags to establish time series in genomic data. This allows for a more comprehensive understanding of cellular processes such as metabolism, cell cycle regulation (15,16), and disease progression, including cancer development (17).

Therefore, this study aims to employ pseudo-time series analysis methods to identify pseudo-time differential genes (PTDGs) involved in the developmental trajectories of LUAD. Based on the identified PTDGs, risk models were constructed to predict the prognosis of patients with LUAD. Additionally, this approach may provide the potential basis for identifying therapeutic targets and drug candidates. We present this article in accordance with the TRIPOD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1581/rc).


Methods

Data collection and collation

From the Gene Expression Omnibus (GEO) database official website (https://www.ncbi.nlm.nih.gov/geo) and UCSC Xena official website (https://xenabrowser.net), transcriptome data and patient clinical data of the GSE149655 and The Cancer Genome Atlas (TCGA) database were obtained. The TCGA LUAD database had 877 samples from 585 patients, of which 295 were normal/adjacent samples, and 582 were tumor samples. The GSE149655 was integrated into four samples to construct single-cell analysis objects. The differential expression of PTDGs among LUAD patients was evaluated over several cohorts, including TCGA, GSE115002, and GSE116959. And the PTDGs-based risk score model was externally validated in two independent GEO cohorts: GSE30219 (n=85) and GSE37745 (n=106). Then, the protein levels of PTDGs were subsequently validated via the Human Protein Atlas (HPA) database and further confirmed using the Clinical Proteomic Tumor Analysis Consortium (CPTAC)-LUAD proteomic dataset. The flow chart of this study is shown in Figure 1. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Figure 1 Flowchart describing the overall process of this data analysis. GEO, Gene Expression Omnibus; HPA, Human Protein Atlas; TCGA, The Cancer Genome Atlas.

Quality control, cell-type clustering, and major cell-type identification

Initial cell filtering is based on the following criteria: (I) each cell must express at least 500 genes; (II) mitochondrial cutoff values were set at 15%. These measures contribute to quality control and ensure accurate interpretation of the data (18,19). Finally, 4,984 cells were obtained for the downstream analysis. This study utilized Harmony as the method for batch effect removal.

Seurat 4.3.0 was first used to normalize expression matrices by the NormalizeData and ScaleData functions. Then, the FindVariable function was applied to select the top 2,000 variable genes and perform principal component analysis (PCA). The first 15 principal components and a resolution of 0.8 were used with the FindClusters function to generate 16 cell clusters. To assign one of the three major cell types to each cluster, we scored each cluster by the normalized expressions of the following canonical markers: endothelial cells (PECAM1), epithelial cells (EPCAM, SFTPD, PDPN, SFTPB, FOXJ1, SCGB1A1), and fibroblasts (COL1A1). The highest-scored cell type was assigned to each cluster. The clusters assigned to the same cell type were lumped together for analysis. The results were manually examined to ensure their correctness and visualized by the uniform manifold approximation and projection (UMAP).

Epithelial subset analysis

Cells previously annotated as epithelial (n=3,325) were grouped into a subset and reclustered using methods described by the following parameters: Ngenes =(2,000), Npcs =15, Res =0.1. We inferred the copy number alterations (CNAs) of 2 patients by InferCNV to identify malignant epithelial cells using single-cell transcriptomic profiles (20). All epithelial cells, as well as 100 fibroblasts and 300 endothelial cells, were used as the input. An additional 300 fibroblasts and 500 endothelial cells were used as reference controls. The cutoff parameters were applied at 0.1. Using the method described by Peng et al. (21), each cell was scored for the strength of the copy number variation (CNV) signal while defining cancer cells as those with significantly higher scores than others and a significant expression of EPCAM.

Noncancerous epithelial cells (n=2,742) were determined from InferCNV analysis as cells lacking large chromosomal aberrations. Using representative markers for classical airway epithelial cell types, we identified alveolar Type 2 cells (SFTPD, SFTPB), alveolar Type 1 cells (AGER, CAV1, PDPN), club cells (SCGB1A1) and ciliated cells (FOXJ1).

Epithelial cell trajectory analysis

We utilized the Monocle (version 2) algorithm, using pseudo-time-related genes selected by the DifferentialGeneTest function, as the input to identify the differential cancer cell states compared to normal epithelial cells. DDRTree was employed to learn tree-like trajectories. Following dimension reduction and cell ordering, the epithelial cell trajectory was inferred using the default parameters of Monocle.

Selecting prognostic genes

The inclusion and exclusion criteria for patients were then determined. The inclusion criteria for TCGA data were as follows: (I) histological diagnosis of LUAD; (II) each patient had accessible RNA profile data; and (III) each patient had available clinical data. The exclusion criteria for TCGA data were: (I) a lack of relevant survival data; or (II) the survival time/follow-up time was less than 30 days. After screening, the selected 470 patients who met the criteria from the TCGA data set were sent for further analysis (available online: https://cdn.amegroups.cn/static/public/jtd-2025-1581-1.xlsx). Then, pseudotime-related genes were intersected with the TCGA genes, further dividing the patients in the TCGA data set at a 7:3 randomization into a training set of 330 and an internal validation set of 140. Based on the data of the training set, the genes were selected by stepwise Cox regression analysis, and P<0.05 was considered as genes with a significantly different expression.

Construction of risk model

The least absolute shrinkage and selection operator (LASSO) regression and random forest (RF) methods were employed to select prognosis-related genes in LUAD patients comprehensively. A multivariate stepwise Cox regression model was used to determine the optimal set of DEGs. Based on the above results, the risk model for LUAD was constructed, and the risk score of each sample was obtained (available online: https://cdn.amegroups.cn/static/public/jtd-2025-1581-2.xlsx, https://cdn.amegroups.cn/static/public/jtd-2025-1581-3.xlsx). Subsequently, patients were stratified into high-risk and low-risk groups based on the median risk score. The internal validation set was used to assess the efficacy of the risk model. Visualization of the prediction performance in both the training and validation sets was accomplished through time-dependent receiver operating characteristic (ROC) and calibration curves. Survival durations for the training and validation sets were evaluated using the Kaplan-Meier method. Additionally, we compared the model performance with the existing article’s risk score calculation formula (22,23).

Establishment of combination model and nomogram

To evaluate the predictive value of the risk score with other clinical variables, univariate and multivariate Cox analysis for overall survival (OS) were performed, including the risk score and the conventional clinicopathologic variables [age, gender, tumor-node-metastasis (TNM) stage, smoking history, radiation and chemotherapy treatment]. Potential multicollinearity between TNM stage and genes was assessed using variance inflation factor (VIF) and Pearson correlation analyses. A nomogram was constructed to predict the 1-, 3- and 5-year OS of patients with LUAD by combining the risk score and clinical characteristics. The time-dependent ROC curve analysis and calibration plot evaluated the performance of the nomogram.

Multicenter dataset and immunohistochemical staining validation of identified PTDGs

The gene expression patterns of PTDGs were investigated using TCGA-LUAD and two independent GEO datasets, including GSE115002 and GSE116959. And to address generalizability, the PTDGs-based risk score model was externally validated in two additional GEO datasets, GSE30219 (n=85) and GSE37745 (n=106), using Kaplan-Meier survival curves and time-dependent ROC analyses to confirm prognostic performance. Additionally, the protein expression levels for PTDGs were examined through the Human Protein Atlas (HPA) database (https://www.proteinatlas.org/). To further validate these findings, the Clinical Proteomic Tumor Analysis Consortium (CPTAC) LUAD proteomic dataset from the University of Alabama at Birmingham CANcer data analysis Portal (UALCAN) was employed.

Quantification and statistical analysis

All statistical data were analyzed using R statistical software (4.2.2). The R packages used mainly include “Seurat”, “Monocle”, “infercnv”, “randomForestSRC”, “survival” and “timeROC”. Among these, the Seurat package was used for single-cell data analysis. The Monocle package (version 2) was utilized for pseudo-time series analysis of single-cell data, with a significance threshold set at q value <0.01. RF analysis was performed using the randomForestSRC package. The survival package was used to conduct the univariate and multivariate Cox analyses, and the OS difference between high- and low-risk patients was assessed via Kaplan-Meier analysis and log-rank test. LASSO regression for differential gene selection was carried out using the glmnet package. Nomogram and calibration plots were generated with the rms package, and the time-dependent ROC curve analysis was conducted with the timeROC package of R software. A significance level of P<0.05 was considered statistically significant.


Results

Single-cell RNA-seq analysis of LUAD

Gene-expression profiles of 4,984 cells were retained after quality control filtering. Following gene-expression normalization, this group performed PCA and clustered cells using graph-based clustering on the informative PCA space (n=15) (Figure 2A). Then, the presented heat maps of the first six principal components were constructed (Figure 2B). Accounting for the batch effect of the data, the Harmony method for batch correction of the combined data was used (Figure 2C). The FindClusters function was used to generate 16 cell clusters (Figure 2D). Three major cell types, classified as epithelial cells, endothelial cells, and fibroblasts, were detected by characteristic canonical cell markers (Figures 2E,2F,3A). Epithelial cells (n=3,325) were categorized into a subset and clustered into seven discrete epithelial clusters (Figure 3B).

Figure 2 PCA dimensionality reduction and annotation of the primary cell population. (A) JackStraw was used to compare each PC’s P value and uniform distribution (dashed line). The “significant” PC will show strong enrichment of features with low P values (solid line above the dashed line). (B) Heatmap of the first six PCs; (C) dimensionality reduction plots of four samples after the Harmony method, showing no batch effect; (D) after using the FindClusters function, the data was divided into 16 clusters; (E) the marker gene violin expression plots for each cluster; (F) expression levels of representative markers for epithelium, endothelial and fibroblasts are plotted onto the UMAP map. The color key from gray to blue indicates relative expression levels from low to high. PC, principal component; PCA, principal component analysis; UMAP, uniform manifold approximation and projection.
Figure 3 Further annotation after epithelial reclustering and clustering-based copy-number variation resolves cancer from epithelial cells. (A) Annotation of 3 major cell types; (B) reclustering of epithelial cells into 7 clusters; (C) heatmap of CNA profiles inferred from single-cell RNA-seq of epithelial cells of patients. Red indicated genomic amplifications, and blue indicated genomic deletions. The x-axis showed all chromosomes in the numerical order. The y-axis was marked by each cluster and reference clusters. (D) Raincloud plots of the CNV score for each cluster. Clusters 2 and 5 were defined as cancer cells and showed higher CNV scores than others in the raincloud plots, indicating a potential association with LUAD. Clusters 0, 1, 3, 4, 6 were defined as non-malignant cells. (E) marker gene violin expression plots for each epithelial cluster; (F) expression levels of representative markers for epithelium are plotted onto the UMAP map. (G) The UMAP plot demonstrated the main cell types in the epithelium. (H) Clear differences in epithelial cell composition between normal and tumor tissues are shown. CNA, copy number alteration; CNV, copy number variation; LUAD, lung adenocarcinoma; UMAP, uniform manifold approximation and projection.

Clustering-based copy-number variation resolves cancer from epithelial cells

To further identify malignant epithelial cells in LUAD, the somatic large-scale chromosomal CNV of each epithelial cluster was determined. Fibroblasts and endothelial cells were used as reference cells, with no CNV in LUAD (Figure 3C). The result showed that clusters 2 and 5 had significantly higher CNV scores than reference cells, with different chromosomal CNV patterns (Figure 3C,3D). In addition, EPCAM was significantly expressed in clusters 2 and 5, further confirming the malignant features of cancer cells (Figure 3E,3F). Therefore, clusters 2 and 5 were defined as the malignant epithelial cells and clusters 0, 1, 3, 4, and 6 as normal epithelial cells. According to their marker genes, the non-cancer epithelial cell clusters were further annotated into alveolar type 1, alveolar type 2, club, and ciliated cells (Figure 3G). Furthermore, the UMAP of normal and tumor tissue were grouped separately and revealed a distinct cell composition (Figure 3H).

Plasticity of epithelial lung cells and their developmental trajectory into malignant tumor cells with the identification of PTDGs

Previous studies have demonstrated that both alveolar epithelial type II (AT2) cells and club cells have the potential to develop into LUAD cells (24,25). Therefore, we organized AT2, club, and LUAD cancer cells based on their developmental trajectories and to identify the crucial gene expression programs regulating tumor progression (available online: https://cdn.amegroups.cn/static/public/jtd-2025-1581-4.xlsx, https://cdn.amegroups.cn/static/public/jtd-2025-1581-5.xlsx). The pseudo-time analysis revealed that AT2 and club cells could independently transit into LUAD tumors (Figure 4A). Using the differentialGeneTest function, 3,278 pseudo-time differentially expressed genes (PTDGs) involved in tumor progression were obtained by screening with a threshold q value <0.01. The heat map was drawn for the trends of 100 highly pseudo-temporal differential genes along the pseudo-timeline for AT2 cells (Figure 4B). In this study, 100 genes were grouped into four clusters with different expression dynamics. Genes in clusters 1 and 2 were highly expressed at the starting point and decreased over time. In contrast, genes in clusters 3,4 showed an increasing expression trend over time. The trends of six highly pseudo-time-dependent genes were also plotted, as illustrated in Figure 4C.

Figure 4 Tumor epithelium progression studied using pseudo-time analysis and gene-dependent screening. (A) Pseudotime analysis by Monocle2 shows the potential evolutionary trajectory of LUAD with AT2 as the root; and with club cells as the root. (B,C) The heatmap depicted the expression trends of 100 genes along the pseudo-timeline, which were classified into four clusters with distinct expression dynamics. Subsequently, the trends of six genes highly dependent on pseudo-time were plotted. (D) A coefficient profile plot was generated against the log(lambda) sequence to select the optimal parameter (lambda) in the LASSO model. It shows the LASSO coefficient profiles for the 335 candidate prognostic genes. (E) Venn plots displaying the 17 intersecting genes identified by both RF and LASSO screening algorithms. (F-H) A risk model was developed and assessed using the training, internal, and external validation datasets. AUC, area under the curve; DFS, disease-free survival; LASSO, least absolute shrinkage and selection operator; LUAD, lung adenocarcinoma; OS, overall survival; RF, random forest.

Identify prognosis-related genes

By overlapping the pseudo-time genes and TCGA transcriptome data of LUAD, 2,597 significantly DEGs were analyzed for subsequent analysis. The 2,597 DEGs were subject to univariate Cox regression analysis in the training set, and 335 DEGs with prognostic value were identified (P<0.05). Then, LASSO regression and RF algorithms were used, identifying 17 significant genes (Figure 4D,4E). The parameters of RF were applied (ntree =100, nsplit =1). Subsequently, a multivariate stepwise Cox risk regression analysis was conducted, and an optimal set of 13 prognostic DEGs was obtained.

Risk model construction and survival analysis

A risk prediction model was established, and risk score values were calculated. Patients from the TCGA training and validation sets were divided into high-risk and low-risk groups based on the median risk score. The high-risk group included 165 cases, while the low-risk group comprised 165 cases in the training set. In the internal validation set, there were 66 cases in the high-risk group and 74 in the low-risk group. The formula of the model is as follows: Riskscore = EXP[(2.2460) × RIPK2 + (2.4844) × FURIN + (1.7893) × DDIT4 + (1.6429) × TWIST2 + (−0.4098) × XCL2 + (−1.8997) × MAP3K8 + (−0.6379) × BEX5 + (1.8972) × GPRC5A + (−0.8776) × CRNDE + (−3.9409) × SLC43A2 + (1.3648) × PTTG1 + (−1.6548) × ST6GALNAC2 + (−1.1852) × IKZF3 + (−4.7796)]. The risk prediction model performed well in the survival prediction of patients in the training and validation sets. The area under the curve (AUC) for predicting the 5-year OS of patients was 0.767 and 0.664 for patients in the training and internal validation sets, respectively (Figure 4F). To confirm its generalizability, the PTDGs-based risk score model was externally validated, showing strong predictive accuracy for OS and disease-free survival (DFS) in GSE30219 (Figure 4G) and for OS in GSE37745 (Figure 4H). To further validate the risk model, it was compared with the risk calculation formula proposed by Li et al. (22) and Tang et al. (23), revealing favorable outcomes for this model (Figure 5A-5C). According to the Kaplan-Meier analysis, the low-risk group demonstrated a significantly higher survival rate than the high-risk group (Figure 5D-5G).

Figure 5 Prognostic-related genes were screened, and a risk model was developed and evaluated. (A-C) The performance of the risk model was subsequently compared with other published external risk calculation formulas. The ROC curves depict the predicted performance of the risk model at 1, 3, and 5 years. (D-G) Kaplan-Meier curves for OS and DFS were generated for the training, internal, and external validation sets. (H,I) Univariate and multivariate independent predictive analysis to analyze that the risk score and tumor stage were independently associated with OS. (J-L) 1-, 3-, and 5-year ROC curves for the risk score and other clinical features. The marker * shows a statistical difference compared to the risk score. AUC, area under the curve; CI, confidence interval; DFS, disease-free survival; OS, overall survival; ROC, receiver operating characteristic.

Independent prognostic analysis and construction of a nomogram

Collinearity between TNM stage and genes was assessed using VIF and Pearson correlation analyses. The results showed no evidence of multicollinearity, with VIF <5 and a Pearson correlation coefficient of 0.22 (Tables S1,S2). The univariate and multivariate Cox regression analysis results demonstrated that when the P threshold was set to <0.05, the tumor stage and risk score of LUAD could be used as two independent prognostic factors for patients with LUAD (Figure 5H-5I). Furthermore, ROC curves and the DeLong test were employed to assess the predictive accuracy of the risk score (Figure 5J-5L). The AUC values for the risk scores at 1-, 3-, and 5-year intervals surpassed those for age, gender, stage, smoking history, radiation and chemotherapy treatment. Notably, statistically significant differences were observed in the AUC values between the risk score and other clinical characteristics, demonstrating its superior predictive performance.

Simultaneously, the clinical application nomogram of the integrated model is illustrated in Figure 6A. The AUC values for predicting 1-, 3-, and 5-year OS were 0.794, 0.742, and 0.760, respectively (Figure 6B). Calibration curve analysis confirmed good agreement between the predicted and actual 1-, 3-, and 5-year OS rates (Figure 6C). The predictive efficacy of the three models was assessed and compared using the ROC curve and the DeLong test, revealing that the risk model exhibited superior predictive efficacy (Figure 6D-6F). These findings indicate that the risk model resulted in a more accurate and reliable prediction of the outcome of interest.

Figure 6 Construct a nomogram and evaluate PTDGs variation in LUAD tumor versus normal tissues. (A) Prognostic nomogram to predict the OS in LUAD. (B) ROC curves for 1-, 3- and 5-year survival in the TCGA cohort (AUC: 0.794, 0.742, 0.760). (C) Calibration curves for 1, 3, and 5 years. (D-F) 1-, 3-, and 5-year ROC curves for the combined and other models. (G-I) Transcriptional level of 13 PTDGs expression were shown in 510 lung adenocarcinoma tissues compared to 58 normal tissues in TCGA cohort (G), and also presented in LUAD primary tumors in comparison with adjacent normal tissues in GSE115002 (H) and GSE116959 (I). Student’s t-test was used to determine statistical significance. ****, P<0.0001; ***, P<0.001; **, P<0.01; *, P<0.05; ns, not significant. AUC, area under the curve; LUAD, lung adenocarcinoma; OS, overall survival; PTDG, pseudo-time differential gene; ROC, receiver operating characteristic; TCGA, The Cancer Genome Atlas.

Differential expression of PTDGs in LUAD patients in multiple cohorts and validation of the protein levels using the HPA database

We detected the mRNA expression of 13 PTDGs in LUAD samples and adjacent normal tissues using RNA-sequence data from TCGA and GEO independent cohorts. As shown in Figure 6G-6I, the mRNA expression levels of DDIT4, FURIN, IKZF3, PTTG1 and RIPK2 were found to be significantly higher in 510 LUAD tissues compared with 58 healthy tissues, while BEX5, GPRC5A, MAP3K8, SLC43A2, ST6GALNAC2, TWIST2, XCL2 highly expressed in adjacent tissues. And there was no significant difference in the levels of MAP3K8, RIPK2 and ST6GALNAC2 between LUAD tissues and adjacent tissues in GSE115002, as well as FURIN and IKZF3 in GSE116959, which is probably due to the limited number of samples. As shown in Figure 7, DDIT4, FURIN, PTTG1 and RIPK2 were greatly elevated in the LUAD tissues while GPRC5A was decreased in the tumor samples via IHC staining, which is consistent with multicenter data at the transcriptional level. These findings were further corroborated by validation using the CPTAC LUAD proteomic dataset, as shown in Figure 8.

Figure 7 The Human Protein Atlas was used to validate the protein expression levels of the identified pseudo-time differential genes. The representative IHC images of BEX5, DDIT4, FURIN, GPRC5A, IKZF3, MAP3K8, PTTG1, RIPK2, SLC43A2 in LUAD and adjacent normal tissues. Data were obtained from HPA at BEX5 (normal: https://www.proteinatlas.org/ENSG00000184515-BEX5/tissue/lung#imid_16160622, tumor: https://www.proteinatlas.org/ENSG00000184515-BEX5/cancer/lung+cancer#imid_16160465; antibody staining: HPA059058); DDIT4 (normal: https://www.proteinatlas.org/ENSG00000168209-DDIT4/tissue/lung#imid_8490787; tumor: https://www.proteinatlas.org/ENSG00000168209-DDIT4/cancer/lung+cancer#imid_8487032; antibody staining: HPA034508); FURIN (normal: https://www.proteinatlas.org/ENSG00000140564-FURIN/tissue/lung#imid_3452495; tumor: https://www.proteinatlas.org/ENSG00000140564-FURIN/cancer/lung+cancer#imid_3443789; antibody staining: CAB009499); GPRC5A (normal: https://www.proteinatlas.org/ENSG00000013588-GPRC5A/tissue/lung#imid_19236755; tumor: https://www.proteinatlas.org/ENSG00000013588-GPRC5A/cancer/lung+cancer#imid_19230144; antibody staining: HPA007928); IKZF3 (normal: https://www.proteinatlas.org/ENSG00000161405-IKZF3/tissue/lung#imid_6410399; tumor: https://www.proteinatlas.org/ENSG00000161405-IKZF3/cancer/lung+cancer#imid_6410180; antibody staining: HPA024377); MAP3K8 (normal: https://www.proteinatlas.org/ENSG00000107968-MAP3K8/tissue/lung#imid_4639695; tumor: https://www.proteinatlas.org/ENSG00000107968-MAP3K8/cancer/lung+cancer#imid_4638747; antibody staining: HPA017962); PTTG1 (normal: https://www.proteinatlas.org/ENSG00000164611-PTTG1/tissue/lung#imid_12396120; tumor: https://www.proteinatlas.org/ENSG00000164611-PTTG1/cancer/lung+cancer#imid_12390216; antibody staining: HPA045034); RIPK2 (normal: https://www.proteinatlas.org/ENSG00000104312-RIPK2/tissue/lung#imid_4387967; tumor: https://www.proteinatlas.org/ENSG00000104312-RIPK2/cancer/lung+cancer#imid_4399645; antibody staining: HPA015273); SLC43A2 (normal: https://www.proteinatlas.org/ENSG00000167703-SLC43A2/tissue/lung#imid_5903263; tumor: https://www.proteinatlas.org/ENSG00000167703-SLC43A2/cancer/lung+cancer#imid_5902821; antibody staining: HPA021564). Staining method: immunohistochemistry with DAB chromogen and hematoxylin counterstain; magnification: ×20. LUAD, lung adenocarcinoma.
Figure 8 Validation of a subset of pseudo-time differentially expressed genes using CPTAC LUAD proteomic data, with the median of Z-values labeled, and representative immunohistochemistry images of XCL2 in LUAD and adjacent normal tissues (obtained from HPA at normal: https://www.proteinatlas.org/ENSG00000143185-XCL2/tissue/lung#imid_16195334; tumor: https://www.proteinatlas.org/ENSG00000143185-XCL2/cancer/lung+cancer#imid_16193395; antibody staining: HPA057725; staining method: immunohistochemistry with DAB chromogen and hematoxylin counterstain; magnification: ×20). Z-values indicate standard deviations from the median across samples for the specified cancer type. Log2 spectral count ratio values from CPTAC were first normalized within each sample profile and subsequently normalized across all samples. Welch’s t-test was used to determine statistical significance. ****, P<0.0001; ***, P<0.001. LUAD, lung adenocarcinoma.

Discussion

This study examined pseudotime-related genes in LUAD in relation to patient prognoses. According to these results, the PTDGs-based risk model performed well in both the training, internal and external validation datasets. The high-risk group had significantly lower survival rates compared to the low-risk group. Moreover, a comparison between this risk model and previously published risk scores revealed that this model exhibited superior overall performance (Figure 5A-5C). Additionally, Jiang et al. employed single-cell RNA-seq technology to compare gene expression profiles between normal and malignant cells, allowing them to identify DEGs (26). Furthermore, this study employed pseudo-time series analysis to investigate pseudo-time differential expressed genes during tumor formation. This approach helped to identify key genes involved in the malignant cellular transformation. Tang et al. also observed a strong correlation between FABP5 expression and cell type differences using pseudo-time series analysis (27). Overall, pseudo-time series analysis offers a dynamic perspective that enables a comprehensive understanding of the multiple genes and molecular mechanisms involved in the initiation and progression of cancer.

The relationship between the most identified PTDGs and carcinogenesis or prognosis of NSCLC has been reported. DDIT4 is a gene signature associated with resistance to tyrosine kinase inhibitors. Patients identified as high-risk by this signature have demonstrated increased sensitivity to chemotherapy and have exhibited elevated expression levels of common immune checkpoints (28). And FURIN had abnormally high expression in LUAD, which was related to malignant progression (29,30). Further, Zhong et al. found that the expression of GPRC5A decreased in the cancerous tissue regions of NSCLC (31).

For the rigor and accuracy of the study, we detected the expression of 13 PTDGs in LUAD tissues and adjacent normal tissues using multicenter data, IHC staining provided by HPA database, and proteomic data from the CPTAC LUAD dataset. Consistent with the transcriptome expression in the TCGA database, DDIT4, FURIN, PTTG1 and RIPK2 were greatly elevated in the LUAD tissues while GPRC5A were decreased in the tumor samples. Additionally, BEX5, SLC43A2 and IKZF3 were not significantly detected in normal tissues and tumor tissues through immunohistochemistry, but showed significant expression variance in multicenter data. Single-cell sequencing and pseudotime analysis are conducted at the mRNA transcriptome level, whereas the HPA database primarily offers tissue staining data at the protein level. Owing to various biological processes, including post-transcriptional regulation, translational regulation, and protein degradation, mRNA abundance does not always correlate perfectly linearly with protein abundance (32). Moreover, the critical feature of PTDGs is their expression variation trends during cell state transitions, rather than their ultimate absolute expression levels (33). The HPA provides a static snapshot of protein expression at a particular time point, which may not fully capture these dynamic processes (34). Therefore, all these anomalous phenomena need further experimental verification.

Literature review indicates that KRAS is the most frequently mutated oncogene in NSCLC. A comprehensive analysis of 17,095 NSCLC samples from 2015–2020 revealed a KRAS mutation prevalence of 37.2% (3,889 cases) specifically in the LUAD (35). Another review corroborates this, noting KRAS as the most common NSCLC oncogene with an overall mutation rate of approximately 30%, ranging from 20–40% in LUAD (36). This underscores the clinical significance of KRAS-driven LUAD, making it a pertinent focus for investigation. In our study, we utilized GSE149655 as the discovery cohort, which offers a rare single-cell RNA-seq series atlas of early-stage KRAS-mutant LUAD tumors paired with matched normal lung tissues (37). This unique dataset facilitates precise pseudo-time trajectory inference to uncover dynamic PTDGs, providing insights into tumor progression that are challenging to obtain from bulk sequencing or less resolved cohorts. While GSE149655’s selection is rational for exploratory purposes in LUAD, we acknowledge its limitations, including the small sample size and retrospective nature, constrained by available computational resources and equipment. To bolster reliability, we implemented rigorous data quality controls and conducted extensive external validations in larger, diverse cohorts, thereby enhancing the robustness and generalizability of our findings.

Of course, there are some limitations in this study. First, the predictive model’s validation relies on retrospective data and should be confirmed further through prospective studies. Secondly, the sample size was restricted to TCGA and GEO data, requiring further multi-center validation in future research. Finally, this study did not explore the detailed mechanisms of the identified risk genes.


Conclusions

In summary, this study employed a single-cell pseudo-time series analysis to identify PTDGs associated with LUAD progression. Each patient in the TCGA-LUAD cohort received a risk score and was stratified into high-risk and low-risk groups according to the median score. Subsequently, a risk prediction model for LUAD was developed, which may provide a new method for diagnosing and treating this disease.


Acknowledgments

We greatly appreciate TCGA, GEO, UALCAN and HPA databases, and the advice of Dr. Zhengtian Lv on study design.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1581/rc

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1581/prf

Funding: This research was supported by Zhejiang Provincial Natural Science Foundation of China (Nos. LTGY24H180006, LTGY23H180001), Medical and Health Science and Technology Project of Zhejiang Province (Nos. 2024KY129, 2025KY093, 2024KY132, 2022KY230), Zhejiang Province Traditional Chinese Medicine Science and Technology Plan Project (Nos. 2025ZL302, 2025ZS012); and the Research Project of Zhejiang Chinese Medical University (No. 2022FSYYZY08). The study sponsors had no role in the study design, in the collection, analysis and interpretation of data; in the writing of the manuscript; and in the decision to submit the manuscript for publication.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1581/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Siegel RL, Kratzer TB, Giaquinto AN, et al. Cancer statistics, 2025. CA Cancer J Clin 2025;75:10-45. [Crossref] [PubMed]
  2. Herbst RS, Morgensztern D, Boshoff C. The biology and management of non-small cell lung cancer. Nature 2018;553:446-54. [Crossref] [PubMed]
  3. Chen Z, Zhao M, Li M, et al. Identification of differentially expressed genes in lung adenocarcinoma cells using single-cell RNA sequencing not detected using traditional RNA sequencing and microarray. Lab Invest 2020;100:1318-29. [Crossref] [PubMed]
  4. Yang D, Ma X, Song P. A prognostic model of non small cell lung cancer based on TCGA and ImmPort databases. Sci Rep 2022;12:437. [Crossref] [PubMed]
  5. Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med 2020;52:1452-65. [Crossref] [PubMed]
  6. Li PH, Kong XY, He YZ, et al. Recent developments in application of single-cell RNA sequencing in the tumour immune microenvironment and cancer therapy. Mil Med Res 2022;9:52. [Crossref] [PubMed]
  7. Chung W, Eum HH, Lee HO, et al. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun 2017;8:15081. [Crossref] [PubMed]
  8. Darmanis S, Sloan SA, Croote D, et al. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma. Cell Rep 2017;21:1399-410. [Crossref] [PubMed]
  9. Ma KY, Schonnesen AA, Brock A, et al. Single-cell RNA sequencing of lung adenocarcinoma reveals heterogeneity of immune response-related genes. JCI Insight 2019;4:e121387. [Crossref] [PubMed]
  10. Clarke J, Panwar B, Madrigal A, et al. Single-cell transcriptomic analysis of tissue-resident memory T cells in human lung cancer. J Exp Med 2019;216:2128-49. [Crossref] [PubMed]
  11. Yang Q, Zhang H, Wei T, et al. Single-Cell RNA Sequencing Reveals the Heterogeneity of Tumor-Associated Macrophage in Non-Small Cell Lung Cancer and Differences Between Sexes. Front Immunol 2021;12:756722. [Crossref] [PubMed]
  12. Chen J, Tan Y, Sun F, et al. Single-cell transcriptome and antigen-immunoglobin analysis reveals the diversity of B cells in non-small cell lung cancer. Genome Biol 2020;21:152. [Crossref] [PubMed]
  13. Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 2014;344:1396-401. [Crossref] [PubMed]
  14. Liu Z, Lou H, Xie K, et al. Reconstructing cell cycle pseudo time-series via single-cell transcriptome data. Nat Commun 2017;8:22. [Crossref] [PubMed]
  15. Campbell KR, Yau C. Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nat Commun 2018;9:2442. [Crossref] [PubMed]
  16. Gupta A, Bar-Joseph Z. Extracting dynamics from static cancer expression data. IEEE/ACM Trans Comput Biol Bioinform 2008;5:172-82. [Crossref] [PubMed]
  17. Li Y, Swift S, Tucker A. Modelling and analysing the dynamics of disease progression from cross-sectional studies. J Biomed Inform 2013;46:266-74. [Crossref] [PubMed]
  18. Haghverdi L, Lun ATL, Morgan MD, et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018;36:421-7. [Crossref] [PubMed]
  19. Subramanian A, Alperovich M, Yang Y, et al. Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics. Genome Biol 2022;23:267. [Crossref] [PubMed]
  20. Tirosh I, Izar B, Prakadan SM, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 2016;352:189-96. [Crossref] [PubMed]
  21. Peng J, Sun BF, Chen CY, et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res 2019;29:725-38. [Crossref] [PubMed]
  22. Li F, Ge D, Sun SL. A novel ferroptosis-related genes model for prognosis prediction of lung adenocarcinoma. BMC Pulm Med 2021;21:229. [Crossref] [PubMed]
  23. Tang X, Qi C, Zhou H, et al. A novel metabolic-immune related signature predicts prognosis and immunotherapy response in lung adenocarcinoma. Heliyon 2022;8:e10164. [Crossref] [PubMed]
  24. Sarode P, Mansouri S, Karger A, et al. Epithelial cell plasticity defines heterogeneity in lung cancer. Cell Signal 2020;65:109463. [Crossref] [PubMed]
  25. Kim N, Kim HK, Lee K, et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat Commun 2020;11:2285. [Crossref] [PubMed]
  26. Jiang A, Wang J, Liu N, et al. Integration of Single-Cell RNA Sequencing and Bulk RNA Sequencing Data to Establish and Validate a Prognostic Model for Patients With Lung Adenocarcinoma. Front Genet 2022;13:833797. [Crossref] [PubMed]
  27. Tang Q, Mao X, Chen Z, et al. Liquid-liquid phase separation-related gene in gliomas: FABP5 is a potential prognostic marker. J Gene Med 2023;25:e3517. [Crossref] [PubMed]
  28. Shi Y, Xu Y, Xu Z, et al. TKI resistant-based prognostic immune related gene signature in LUAD, in which FSCN1 contributes to tumor progression. Cancer Lett 2022;532:215583. [Crossref] [PubMed]
  29. Li C, Yuan Y, Jiang X, et al. Identification and validation of tumor microenvironment-related signature for predicting prognosis and immunotherapy response in patients with lung adenocarcinoma. Sci Rep 2023;13:13568. [Crossref] [PubMed]
  30. Shu L, Tang J, Liu S, et al. Plasma cell signatures predict prognosis and treatment efficacy for lung adenocarcinoma. Cell Oncol (Dordr) 2024;47:555-71. [Crossref] [PubMed]
  31. Zhong S, Yin H, Liao Y, et al. Lung Tumor Suppressor GPRC5A Binds EGFR and Restrains Its Effector Signaling. Cancer Res 2015;75:1801-14. [Crossref] [PubMed]
  32. Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet 2012;13:227-32. [Crossref] [PubMed]
  33. Song D, Li JJ. PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data. Genome Biol 2021;22:124. [Crossref] [PubMed]
  34. Roy AL, Conroy RS. Toward mapping the human body at a cellular resolution. Mol Biol Cell 2018;29:1779-85. [Crossref] [PubMed]
  35. Judd J, Abdel Karim N, Khan H, et al. Characterization of KRAS Mutation Subtypes in Non-small Cell Lung Cancer. Mol Cancer Ther 2021;20:2577-84. [Crossref] [PubMed]
  36. Reita D, Pabst L, Pencreach E, et al. Direct Targeting KRAS Mutation in Non-Small Cell Lung Cancer: Focus on Resistance. Cancers (Basel) 2022;14:1321. [Crossref] [PubMed]
  37. Dost AFM, Moye AL, Vedaie M, et al. Organoids Model Transcriptional Hallmarks of Oncogenic KRAS Activation in Lung Epithelial Progenitor Cells. Cell Stem Cell 2020;27:663-678.e8. [Crossref] [PubMed]
Cite this article as: Jin H, Jin H, Wu T, Huang K, Lin J, Wu L, Gao C. Tumor prognostic risk stratification based on pseudo-time analysis of single-cell sequencing for patients with lung adenocarcinoma. J Thorac Dis 2025;17(12):10852-10868. doi: 10.21037/jtd-2025-1581

Download Citation