Transcriptome analysis and artificial intelligence for predicting lymph node metastasis of esophageal squamous cell carcinoma
Original Article

Transcriptome analysis and artificial intelligence for predicting lymph node metastasis of esophageal squamous cell carcinoma

Zhengang Zhao1# ORCID logo, Yujie Xie2#, Dongmei Lai3, Jin Liang1, Ikenna C. Okereke4, Wanli Lin1

1Department of Thoracic Surgery, Gaozhou People’s Hospital Affiliated to Guangdong Medical University, Maoming, China; 2Lung Cancer Center, West China Hospital of Sichuan University, Chengdu, China; 3Department of Oncology, Gaozhou People’s Hospital Affiliated to Guangdong Medical University, Maoming, China; 4Department of Surgery, Henry Ford Health, Detroit, MI, USA

Contributions: (I) Conception and design: Z Zhao, Y Xie; (II) Administrative support: J Liang, W Lin; (III) Provision of study materials or patients: Z Zhao, Y Xie; (IV) Collection and assembly of data: D Lai, Y Xie; (V) Data analysis and interpretation: Z Zhao, W Lin; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

#These authors contributed equally to this work.

Correspondence to: Wanli Lin, MB. Department of Thoracic Surgery, Gaozhou People’s Hospital Affiliated to Guangdong Medical University, 89 Xiguan Road, Maoming 525200, China. Email: wanlilin2020@163.com.

Background: Lymph node metastasis (LNM) is the most common route of metastasis in esophageal squamous cell carcinoma (ESCC), and the treatment of patients with ESCC largely depends on the LNM status. The methods for diagnosing LNM in ESCC are still not accurate enough, and accurate LNM staging is crucial for clinical practice. The purpose of this study was to investigate the value of combining transcriptome analysis with artificial intelligence (AI) in predicting LNM and to construct an effective predictive model for LNM in ESCC.

Methods: We first enrolled 36 patients with ESCC for RNA sequencing (RNA-seq) to identify the differentially expressed messenger RNA (mRNAs), and then selected candidate genes via a random forest machine learning algorithm. Quantitative real-time polymerase chain reaction (qRT-PCR) was used to detect the expression of three candidate genes. For the assessment of the overall survival (OS) of patients with ESCC, we used the Kaplan-Meier method and the log-rank test. Univariate and multivariate logistic regression analyses were performed to screen for risk model factors. The model was validated with the area under the curve (AUC) and visualized through a nomogram. For AI model building, random forest was conducted. We included five variables to create the AI model, and divided the data from 209 patients into a training set and a validation set to evaluate the model’s performance. Thereafter, receiver operating characteristic (ROC) curves and the AUC were used to validate the AI system and to conduct subgroup analyses.

Results: RNA-seq identified 2,837 genes that were differentially expressed in ESCC tissues with LNM. We used a random forest machine learning algorithm to eliminate candidate diagnostic genes for patients with ESCC with LNM, with the three most diagnostic genes being SIM2, CUX1, and CYP4B1. Analysis of OS indicated that patients with LNM had a worse prognosis. Low expression levels of SIM2 were negatively correlated with OS. However, high expression levels of CUX1 or CYP4B1 were negatively correlated with OS. We added five independent influencing factors to the risk model via univariate and multivariate logistic regression analyses. In the ROC curve analysis, the AUC for the most effective logistic regression model was 0.83. A nomogram was used to display the predictive variables. Finally, these five variables were used to create the AI model, and the AUC was 0.78. Moreover, the subgroup analysis indicated that the AI model that incorporated only the T3 clinical tumor stage yielded an AUC of 0.78.

Conclusions: AI and transcriptome analysis can be used to create a risk model for predicting LNM, and it can enhance prediction accuracy and inform clinical staging and decision-making before surgery.

Keywords: Transcriptome analysis; artificial intelligence (AI); esophageal squamous cell carcinoma (ESCC); lymph node metastasis (LNM)


Submitted Mar 29, 2025. Accepted for publication May 16, 2025. Published online May 28, 2025.

doi: 10.21037/jtd-2025-662


Highlight box

Key findings

• Using transcriptome analysis combined with artificial intelligence (AI), we constructed effective models predicting lymph node metastasis (LNM) in esophageal squamous cell carcinoma (ESCC).

What is known and what is new?

• AI has been used to for predictive modeling in many recent studies, with a number of genetic biomarkers being identified via transcriptome analysis.

• In this study, we first conducted RNA sequencing to screen for messenger RNAs associated with LNM in ESCC. A random forest machine learning algorithm was then used to rank candidate diagnostic genes, with which an effective predictive model for LNM in ESCC was constructed. The results indicated that the AI model, as well as logistic regression modeling, could effectively predict LNM in ESCC.

What is the implication, and what should change now?

• This predictive model could potentially serve as a powerful tool for developing better staging and treatment plans in clinical practice, especially for those who require neoadjuvant therapy. Improvements to the model through the use of larger sample sizes are still needed.


Introduction

Esophageal cancer is a common malignant tumor, ranking seventh in mortality and eleventh in incidence rate (1). There are two main pathological subtypes, esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma. ESCC is more prevalent in the eastern hemisphere, South America and central Asia (2). The prognosis of ESCC is poor, with a 10-year survival rate of approximately 20% for patients with locally advanced ESCC (3). Lymph node metastasis (LNM) is the most important prognostic factor for ESCC and affects treatment paradigm (4), and the number of lymph nodes resection is significantly associated with patients’ overall survival (OS) (5). At present, the diagnosis of LNM in ESCC is mostly based on computed tomography (CT), positron emission tomography-computed tomography (PET-CT), or endoscopic ultrasonography. Unfortunately, these preoperative diagnostic tests are still of a relatively high missing rate for the diagnosis of LNM (6). Esophagectomy combined with systematic lymph node dissection is currently the primary treatment for ESCC (7). Preoperative LNM status is critically important for treatment selection in patients with ESCC. If LNM is diagnosed before surgery, neoadjuvant therapy is the preferred treatment option and results in improved OS compared to upfront surgery (2,8). For patients with early-stage ESCC without LNM, radical nodal dissection may be unnecessary as it is associated with a high rate of postoperative complications (9). Understanding the unique biology of LNM in patients with ESCC and predicting the possibility of further metastasis and spread can optimize the best treatment benefits for patients (10).

Prediction models based on the clinical characteristics of patients with ESCC have been developed, including nomograms (11) and prediction models based on machine learning (12) or on diagnostic molecular markers for ESCC. By combining long noncoding RNA and clinical features, a nomogram was constructed with good node prediction performance for lymph metastasis (13). These markers include cell-free DNA (14), microRNA (15), or long noncoding RNA (16). Long non-coding RNAs have emerged as novel biomarkers in ESCC, especially as predictive indicators for LNM (17). These are all important tools for diagnosing LNM in ESCC. However, there are few reports in the literature on prediction models that combine transcriptome analysis and machine learning. In this study, we used transcriptome RNA sequencing (RNA-seq) to identify the key genes associated with LNM. Then, we further applied machine learning to screen for the candidate genes, and finally constructed a risk prediction model for LNM based on the random forest algorithm in combination with transcriptome analysis. Research indicates that regression analysis combined with a nomogram can construct a predictive model for LNM, which holds significant diagnostic value (18). We developed an LNM regression model incorporating five variables and constructed a nomogram, which yielded an area under the curve (AUC) value of 0.83. Artificial intelligence (AI) has undergone rapid advancements across various aspects of cancer research, particularly in the detection, classification, and prognostic analysis of cancer (19). Along this vein, we further constructed a predictive model based on a random forest machine learning algorithm, which had a high AUC value in the subgroup analysis. We present this article in accordance with the TRIPOD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-662/rc).


Methods

Study population and clinical samples

We included data from all patients with ESCC (cT1–3N0M0) who underwent esophagectomy with systematic lymph node dissection at Gaozhou People’s Hospital Affiliated to Guangdong Medical University between December 2019 and December 2024. Patients who met the following inclusion criteria were considered eligible: (I) completion of endoscopy before surgery, with ESCC confirmed by biopsy; (II) no history of concurrent or previous malignant tumors; and (III) standardized systematic lymph node dissection (more than 15). Meanwhile, the exclusion criteria were as follows: (I) any type of neoadjuvant therapy (including neoadjuvant chemoradiotherapy, immunotherapy, hormone therapy, or other systemic therapy); (II) T categorization as in situ or unknown; and (III) missing data. This study ultimately included 209 tissue samples from patients with ESCC that were subjected to quantitative real-time polymerase chain reaction (qRT-PCR), with 32 of these samples being subjected to RNA-seq. All tissues were stored in liquid nitrogen immediately after collection. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The procedures involving human participants were approved by the Medical Ethics Committee of the Gaozhou People’s Hospital Affiliated to Guangdong Medical University (No. GYLLYJ-2023216). Written informed consent was provided by all patients. All patients who met the criteria were followed up according to the standard after surgery until the last visit.

Selection of clinical parameters

The clinical data of the enrolled patients were collected from the Gaozhou People’s Hospital Affiliated to Guangdong Medical University, all ESCCs were staged according to the eighth edition of the American Joint Committee on Cancer (AJCC) staging system. We collected the following information, serving as the clinical parameters used in the analyses: gender, age, differentiation grade, tumor stage, LNM dissection, and smoking history. We performed the χ2 test for comparison of nonparametric variables, and used a two-tailed unpaired Student t-test for the comparison of parametric variables with a normal distribution. Quantitative data were presented as the mean ± standard deviation (SD).

RNA extraction and qRT-PCR

We used the Cell/Tissue Total RNA Kit (cat. no. 19221ES50; Yeasen, Shanghai, China) to extract total RNA from ESCC tissue. We applied Hifair III 1st Strand cDNA Synthesis Kit (gDNA digester plus) (cat. no. 11141ES60; Yeasen) for reverse transcription. We used Hieff UNICON Universal Blue qPCR SYBR Master Mix to conduct qRT-PCR (cat. no. 11184ES08; Yeasen). To control the variability of candidate gene expression levels, the reference gene glyceraldehyde 3-phosphate dehydrogenase (GAPDH) was used as the control to standardize the expression data. All the experiments were independently repeated three times. The primers used are listed in Table S1.

RNA-seq

Firstly, we extracted total RNA from ESCC tissue (32 patients), including from 16 patients with LNM and from another 16 patients without LNM, which were subjected to RNA-seq to identify the differentially expressed messenger RNA (mRNAs). We performed library preparation and RNA-seq on an Illumina platform (Illumina, San Diego, CA, USA). Differential expression analysis between the two groups was performed with the “DESeq2” package in R (The R Foundation for Statistical Computing). The cutoff value of differentially expressed RNAs was set as |log2 fold change| >1 and an adjusted P value <0.05. To demonstrate the results of differential expression, we used the “pheatmap” R package to generate heatmaps and the “ggVolcano” R package to create volcano plots.

Selection of candidate genes via a random forest machine learning algorithm

Based on the correlation between the error rate and the total number of trees in a random forest error rate plot, the random forest model applied 300 decision trees. In order to screen further for any differentially expressed candidate genes, the “randomForest” R package was used to apply random forest machine learning algorithms to the expression profiles of ESCC and LNM status, with the relative importance scores for each gene being calculated. Genes were ranked according to their relative importance, and the top 20 most important genes were displayed. Among the top 20 most important genes, the relative importance score of SIM2 was 0.392, CUX1 was 0.376, and CYP4B1 was 0.249, which had a much higher ranking level than the others. We finally selected these top three genes from the random forest models to predict the ESCC and LNM status, and they were considered to be the candidate predictive genes.

OS analysis

For the assessment of the OS of patients with ESCC, Kaplan-Meier curves were used to estimate the survival probability over time. Differences between groups were compared using the log-rank test. The “survminer” and “survival” R packages were used to plot the Kaplan-Meier curves and perform the survival analysis. Regarding survival analysis grouping, we formed groups according to LNM status of ESCC or according to the median expression cutoff values of the qRT-PCR analysis.

Univariate and multivariate logistic regression analyses

Univariate and multivariate logistic regression analyses were conducted to assess the significance of the variables associated with LNM. We first performed univariate logistic regression analyses, and included the variables with P<0.05 in the multivariate logistic regression analysis. We incorporated the following five variables into the final risk model: differentiation grade, tumor stage, CUX1 expression, CYP4B1 expression, and SIM2 expression. We also created a nomogram based on the above five variables from the final risk model. All the logistic regression analyses were performed with the “rms” R package. We finally plotted the receiver operating characteristic (ROC) curves to evaluate the predictive value by using the “pROC” R package, and the AUC was used to determine the models’ ability to predict LNM.

AI model building with random forest

We divided the data from 209 patients into a training set (70%) and a validation set (30%) to evaluate the model’s performance. We first used a random forest error rate plot to identify the optimal decision tree in random forest, and the final random model applied 220 decision trees. The “caret” R package was used to perform cross-validation, ensuring the robustness of the machine learning model. After feature selection, our random forest model ultimately incorporated five key variables (differentiation grade, tumor stage, CUX1 expression, CYP4B1 expression, and SIM2 expression) with the highest importance scores. The “randomForest” R package was used to apply random forest machine learning algorithms to these five variables and LNM status.

Validation of the AI system and subgroup analysis

We plotted ROC curves to evaluate the prediction performance of the AI system via the R package “pROC”, ROC analysis was performed by graphing sensitivity on the y-axis versus specificity on the x-axis, and AUC was then quantified to assess the model’s classification performance. The AUC was applied to determine the models’ value for predicting LNM. For the subgroup analysis, we divided all patients into three groups based on the primary tumor stage (T1, T2, and T3). We also plotted ROC curves to evaluate these models, with the AUC being calculated based on the ROC curves.

Statistical analysis

To evaluate the sensitivity and specificity of the model predictions, we constructed ROC curves with AUC as the evaluation metric. OS was defined as the time from the first diagnosis to death for any reason. If patients were still alive during the follow-up period, it was defined as censored. We performed Kaplan-Meier curves and the log-rank test to evaluate survival analysis. A P value less than 0.05 was considered statistically significant. All the statistical analyses were performed with R version 4.4.1 (The R Project for Statistical Computing, Vienna, Austria, https://www.r-project.org).


Results

Identification of differentially expressed genes between ESCC patients with and without LNM

In order to identify the genes related to LNM in ESCC, we performed RNA-seq on 32 patients with ESCC who underwent surgery. Among these patients, 16 had pathologic lymph node involvement, and another 16 patients did not. Of the total 18,782 annotated genes, 2,837 genes were differentially expressed (log2 fold change >1, adjusted P value <0.05; DESeq2 analysis method). A hierarchical clustering heatmap of the significantly differential genes is shown in Figure 1A. Meanwhile, the volcano plots of the top 10 genes among the 2,837 significantly differential genes (1,762 upregulated and 1,075 downregulated) are displayed in Figure 1B. We conducted Gene Ontology (GO) analysis on the significantly differential genes, including for biological process (BP), cellular component (CC), and molecular function (MF). GO analysis revealed that these genes were primarily enriched in processes such as collagen-containing extracellular matrix (ECM) and growth factor binding (Figure 1C). A circular clustering diagram of the GO enrichment analysis is shown in Figure 1D. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment showed that these genes were primarily involved in the ECM-receptor interaction signaling pathway, cytokine-cytokine receptor interaction signaling pathway, and protein digestion and absorption signaling pathway (Figure 1E). A circular clustering diagram of the KEGG pathway enrichment analysis is shown in Figure 1F.

Figure 1 Gene expression signature for LNM in ESCC identified by transcriptome analysis. (A) Heatmap of the expression profiles for 16 ESCC tumor samples with LNM (LIN+) compared with 16 ESCC tumor samples without LNM (LIN−). (B) Volcano plots of all differentially expressed genes between ESCC tumor samples with LNM (LIN+) as compared with ESCC tumor samples without LNM (LIN−) according to RNA-seq analysis. (C) Functional annotation in GO analysis for genes of matched LIN+/LIN− samples. (D) A circular clustering diagram of the GO enrichment analysis. (E) KEGG pathway enrichment for genes of matched LIN+/LIN− samples. (F) A circular clustering diagram of the KEGG pathway enrichment analysis. BP, biological process; CC, cellular component; ECM, extracellular matrix; ESCC, esophageal squamous cell carcinoma; FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; LIN–, patients without lymph metastasis; LIN+, patients with lymph node metastasis; LNM, lymph node metastasis; MF, molecular function; RNA-seq, RNA sequencing.

Candidate genes identified via a random forest model and qRT-PCR

To further screen the candidate diagnostic genes for patients with ESCC with LNM, we used a random forest machine learning algorithm. Based on the correlation between the error rate and the total number of trees, the random forest model applied 300 decision trees (Figure 2A). The random forest model identified 20 genes using relative importance scores, and the top three genes were SIM2, CUX1, and CYP4B1 (Figure 2B). We visualized these top three genes in a heatmap based on RNA-seq analysis, which showed that SIM2 expression level was higher in the majority of ESCC tissues with LNM. Meanwhile, CYP4B1 and CUX1 expression were lower in most ESCC tissues with LNM compared with ESCC tissues without LNM (Figure 2C). To verify the RNA-seq analysis, we further conducted qRT-PCR analysis of 209 ESCC tissues. The results showed that SIM2 expression was significantly lower in ESCC tissue with LNM, while CYP4B1 and CUX1 expression were higher in ESCC tissue with LNM, which was consistent with the RNA-seq analysis (Figure 2D).

Figure 2 Candidate biomarkers identified by random forest and verified by qRT-PCR. (A) The correlation between the error rate and the total number of trees in the RNA-seq analysis according to the random forest algorithm. (B) The top 20 genes ranked by relative importance scores via the random forest algorithm. (C) Heatmap of the expression profiles for three candidate genes in 16 ESCC tumor samples with LNM (LIN+) as compared with 16 ESCC tumor samples without LNM (LIN−). (D) qRT-PCR analysis of the expression of three candidate genes (including SIM2, CUX1, and CYP4B1) in 91 LIN+ patients compared with 118 LIN– patients. ***, P<0.001. ESCC, esophageal squamous cell carcinoma; LIN+, patients with lymph node metastasis; LIN−, patients without lymph metastasis; N, normal tissues; qRT-PCR, quantitative real-time polymerase chain reaction; RNA-seq, RNA sequencing; T, tumor tissues.

LNM and the three candidate genes were associated with the prognosis of patients with ESCC

The Kaplan-Meier survival curve was used to analyze the OS data. First, we analyzed the correlation of LNM status with the prognosis of patients with ESCC, which indicated that patients with LNM had a worse prognosis (Figure 3A). After qRT-PCR analysis was performed on the three candidate genes from 209 patients with ESCC, the median expression cutoff value indicated that low expression levels of SIM2 were negatively correlated with OS (Figure 3B). Meanwhile, high expression levels of CUX1 (Figure 3C) or CYP4B1 (Figure 3D) were negatively correlated with OS. Collectively, these findings suggest that LNM status and the three candidate genes are closely correlated with the prognosis of patients with ESCC.

Figure 3 The survival curve analysis of lymph node metastasis status and the expression of the three candidate genes. (A) Survival analysis of patients with lymph node metastasis versus patients without lymph metastasis. (B) Survival analysis of the high- vs. low-expression groups for SIM2 in ESCC. (C) Survival analysis of the high- vs. low-expression groups for CUX1 in ESCC. (D) Survival analysis of the high- vs. low-expression groups for CYP4B1 in ESCC. DFS, disease-free survival; ESCC, esophageal squamous cell carcinoma; LIN−, patients without lymph metastasis; LIN+, patients with lymph node metastasis; OS, overall survival.

Baseline clinical characteristics of all patients and construction of a logistic regression model

After the application of selection criteria, 209 patients with ESCC were included in the analysis, among whom 91 (44%) had LNM and 118 did not. We used χ2 tests and found that patients with LNM were more likely to have a poor tumor differentiation grade (P<0.001). Moreover, LNM was significantly associated with higher primary tumor staging (T3 + T4 vs. T1 + T2; P=0.003), but not with age, gender, or smoking history. To control for the influence caused by the number of lymph nodes dissected, we conducted a t test, which showed that the mean number of LNMs dissected was 27 and there was no difference between the two groups of patients in this regard (Table 1). To develop a prediction model, we conducted univariable and multivariable logistic regressions. We first included five clinical characteristics and three candidate genes for univariate analysis. This indicated that differentiation grade, primary tumor invasion depth, and the expression of three candidate genes were significantly associated with LNM. Finally, the multivariate analysis revealed that differentiation grade [odds ratio (OR): 2.900; 95% CI: 1.433–5.871], tumor stage (OR: 2.051; 95% CI: 1.040–4.043), CUX1 expression (OR: 2.215; 95% CI: 1.287–3.813), CYP4B1 expression (OR: 2.237; 95% CI: 1.224–4.090), and SIM2 expression (OR: 0.428; 95% CI: 0.261–0.701) were the independent risk factors. These five variables were included in the final model for predicting LNM in ESCC (Table 2).

Table 1

Clinical parameters of ESCC patients with and without pathologic lymph node metastasis

Clinical parameter LIN− (n=118) LIN+ (n=91) P value
Gender 0.68
   Male 63 46
   Female 55 45
Age (years) 68.40±9.78 69.10±10.68 0.62
Differentiation grade <0.001
   Poor 28 46
   Well/moderate 90 45
LNM dissection 27.20±8.15 27.18±8.25 0.99
Primary tumor invasion depth 0.003
   T1/T2 66 32
   T3/T4 52 59
Smoker 0.81
   No 46 34
   Yes 72 57

Data are presented as mean ± standard deviation or n. ESCC, esophageal squamous cell carcinoma; LIN−, lymph node pathology negative; LIN+, lymph node pathology positive; LNM, lymph node metastasis.

Table 2

Univariate and multivariate logistic regression analysis of factors for predicting lymph node metastasis

Clinical parameter Univariate analysis Multivariate analysis
OR 95% CI P OR 95% CI P
Age 1.007 0.980–1.034 0.62
Smoking 1.071 0.610–1.881 0.81
Gender 0.876 0.506–1.517 0.64
Differentiation grade 3.286 1.820–5.931 <0.001 2.900 1.433–5.871 0.003
T stage 2.34 1.332–4.110 0.003 2.051 1.040–4.043 0.04
CUX1 expression 2.865 1.780–4.612 <0.001 2.215 1.287–3.813 0.004
CYP4B1 expression 4.096 2.279–7.362 <0.001 2.237 1.224–4.090 0.009
SIM2 expression 0.325 0.207–0.510 <0.001 0.428 0.261–0.701 0.001

CI, confidence interval; OR, odds ratio; T stage, tumor stage.

Construction of the ROC curve and nomogram for the validation of the logistic regression model

In constructing the predictive model, we first created a nomogram consisting of five predictive variables and the corresponding point scales (Figure 4A). We then verified the ability of these five variables to predict LNM via the ROC curve. The overall AUC of the logistic regression model for patients with ESCC was 0.83 (Figure 4B). We also compared this logistic regression model with another logistic regression model that only included clinical tumor grade and stage. We found that the AUC of the logistic regression model composed of these two clinical characteristics was 0.69 (Figure 4C). Finally, we developed a logistic regression model composed only of the three candidate genes (CUX1, CYP4B1, and SIM2 expression), which yielded an AUC of 0.78 (Figure 4D). Overall, the results indicated that the predictive model, including five predictive variables, was the most effective for predicting LNM in patients with ESCC.

Figure 4 Nomogram and ROC curves for models’ ability to predict LNM in patients with ESCC. (A) Nomogram for the logistic regression model incorporating three candidate genes and two clinical parameters in predicting the likelihood of pathologic-positive LNM in patients with ESCC. (B) ROC analysis of the logistic regression model incorporating three candidate genes and two clinical parameters. (C) ROC analysis of logistic regression model incorporating two clinical parameters: tumor stage and tumor grade. (D) ROC analysis of logistic regression model incorporating three candidate genes (SIM2, SUX1, and CYP4B1). AUC, area under the curve; ESCC, esophageal squamous cell carcinoma; LNM, lymph node metastasis; ROC, receiver operating characteristic; T, tumor.

Validation of the AI model and subgroup analysis

To verify the performance of the AI model in predicting LNM in patients with ESCC, we used a random forest machine learning algorithm with five variables: differentiation grade, tumor stage, CUX1 expression, CYP4B1 expression, and SIM2 expression. We first assessed this model with the whole validation dataset (n=63), which yielded an AUC of 0.78 (Figure 5A). We also assessed the AI model in terms of clinical tumor stage among all patients. When the primary tumor stage was T1, T2 and T3, the AUC values were 0.67 (Figure 5B), 0.76 (Figure 5C) and 0.78 (Figure 5D) respectively. Overall, our AI model was confirmed to be an effective tool for predicting LNM in ESCC, especially when the clinical tumor stage was T3.

Figure 5 ROC curves for the diagnostic ability of machine learning in patients with ESCC and LNM. (A) ROC analysis of random forest for predicting LNM in patients with all-stage ESCC. (B) ROC analysis of random forest for predicting LNM in patients with ESCC with clinical tumor stage T1. (C) ROC analysis of random forest for predicting LNM in patients with ESCC with clinical tumor stage T2. (D) ROC analysis of random forest for predicting LNM in patients with ESCC with clinical tumor stage T3. AUC, area under the curve; ESCC, esophageal squamous cell carcinoma; LNM, lymph node metastasis; ROC, receiver operating characteristic; T, tumor.

Discussion

ESCC often presents with LNM in the upper mediastinal and cervical regions, with an incidence of LNM ranging from 15% to 34%. LNM status is a critical prognostic indicator among patients with ESCC (2). However, the accuracy of commonly used diagnostic methods remains insufficient, and the incidence of occult LNM remains particularly high. Recently, diagnostic models based on multiple factors have been developed to predict LNM (11). Meanwhile, transcriptome sequencing analysis can identify key genes closely associated with LNM in ESCC. In this study, we performed RNA-seq on 32 patients with ESCC. This analysis generated 2,837 differentially expressed genes, with CTS1 being among the top 10 most significantly differentially expressed genes (20). Previous work has shown that CST1 is significantly upregulated in ESCC tissues and promotes the growth and metastasis of ESCC. Additionally, MAGEA4 plays a crucial role in the early stages of ESCC, and its overexpression has the potential to serve as a prognostic marker for early-stage patients with ESCC (21). KEGG pathway enrichment showed that these genes were primarily involved in the ECM-receptor interaction signaling pathway, which is closely related to tumor metastasis (22). Using a random forest algorithm, we identified three potential genes closely related to LNM in ESCC. One of the genes with decreased expression, SIM2, belongs to a family of transcriptional repressors and may control brain development and neuronal differentiation (23). SIM2 expression has been reported to be significantly lower in patients with ESCC (24). In contrast, its expression is significantly co-expressed and elevated in prostate cancer (25). We identified another gene, CUX1, which was significantly upregulated in ESCC tissues with LNM and may serve as a biomarker for LNM. This result is in line with another study, which found that CUX1 protein levels were upregulated in ESCC tissues and were closely associated with Tumor Node Metastasis (TNM) staging, LNM, and the prognosis of patients with ESCC. These data suggest that the CUX1 gene promotes the progression of ESCC and may serve as a biomarker for diagnosis and treatment (26). CYP4B1 was also significantly upregulated in ESCC and was closely associated with LNM, suggesting its potential as a predictive biomarker for LNM. This is the first study to report the relationship between CYP4B1 and ESCC. Recent research in lung cancer indicates that CYP4B1, a gene associated with LNM, is linked to poor prognosis in patients with lung adenocarcinoma. A prognostic model based on CYP4B1 may predict the prognosis of patients with lung adenocarcinoma and is associated with immune infiltration (27). Additionally, a study has shown that a high expression of CYP4B1 increases the risk of bladder cancer and may serve as an important tool for predicting bladder cancer risk (28). However, there is a lack of research on the correlation between CYP4B1 and LNM in ESCC.

The random forest algorithm can be used to screen and identify variables closely related to LNM in ESCC and rank them by importance (12). We also used a random forest model to identify 20 important genes according to the relative importance scores, and the top three genes were SIM2, CUX1, and CYP4B1. These candidate genes were also found to be closely associated with the prognosis of patients with ESCC. Regression analysis can be combined with nomograms to construct a predictive model for LNM, which holds significant diagnostic value (18). We developed an LNM regression model incorporating five variables (differentiation grade, tumor stage, CUX1 expression, CYP4B1 expression, and SIM2 expression) and constructed a nomogram, which yielded an AUC value of 0.83. AI has been widely applied to a broad range of medical fields, including medical diagnosis and medical statistics, and AI may have substantial medical utility and economic value (29). In one study, gene expression data and random forest model analysis were used to construct a prognostic model for gastric cancer, which demonstrated high predictive value for the survival of patients with cancer and their sensitivity to immunotherapy (30). Research also suggests that AI systems, including those based on random forest algorithms, are highly effective in sensitively detecting tumor biomarkers for ESCC (31). These systems overcome the limitations associated with detecting low-abundance circulating proteins and can significantly enhance the diagnosis of ESCC. The use of AI to construct a predictive model for LNM in ESCC has been proven to have good predictive performance. The DL-radiomics-clinical (DRC) model, for example, demonstrated the highest accuracy compared with other predictive models (32). Screening for variables associated with LNM in ESCC is a crucial step in constructing predictive models. Currently, the screening of clinical features is predominantly based on univariate and multivariate logistic regression analyses, with variables showing statistical significance in the multivariate logistic analysis serving as candidate variables for model construction (33,34). In this study, while screening clinical features through univariate and multivariate logistic regression analyses, we also integrated AI and machine learning to identify candidate genes of importance related to ESCC. The final predictive model incorporated five significant variables based on the results from both approaches, demonstrating superior performance compared to models based solely on clinical features or candidate genes alone. AI algorithms, especially random forest algorithms, can significantly improve the accuracy of prediction compared with traditional prediction models. We conducted a predictive model based on random forest machine learning algorithm, with high accuracy, indicated by an AUC value of 0.78. We also performed subgroup analysis, with clinical tumor stage T3 resulting in a higher AUC. In another study that used subgroup analysis of AI for predicting the risk of LNM in colorectal cancer, the AUC for colon cohort was higher than that for the rectum cohort (35).

This study involved several limitations that should be addressed. First, our study was retrospective in nature, and future prospective studies are needed. It relied on a single-center cohort with a limited sample size, which required external validation. Second, the mechanisms by which the three candidate genes influence ESCC remain unclear, particularly in the case of CYP4B1, for which there is a paucity of research regarding its correlation with ESCC. This lack of mechanistic detail calls for further studies to elucidate their roles. Finally, our AI learning model did not outperform the traditional logistic regression model, potentially due to the small sample size of the dataset. As such, validation through the use of larger datasets is warranted, and algorithmic refinement is also needed.


Conclusions

This study is the first to construct a model based on machine learning and transcriptome analysis for predicting LNM in patients with ESCC. Based on RNA-seq analysis and the a random forest machine learning algorithm, we screened and identified three candidate genes (CUX1, CYP4B, and SIM2), which are closely associated with the prognosis of patients with ESCC. The AI model constructed herein can serve as an accurate predictive tool for preoperative staging and determining whether patients require neoadjuvant therapy, which optimizes the benefits for patients with ESCC.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-662/rc

Data Sharing Statement: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-662/dss

Peer Review File: Available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-662/prf

Funding: This work was supported by the 2023 Medical Scientific Research Foundation of Guangdong Province, China (No. A2023483).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-662/coif). I.C.O. serves as an unpaid editorial board member of Journal of Thoracic Disease from February 2025 to January 2027. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with Declaration of Helsinki and its subsequent amendments, and the procedures involving human participants were approved by the Medical Ethics Committee of the Gaozhou People’s Hospital Affiliated to Guangdong Medical University (No. GYLLYJ-2023216). Written informed consent was provided by all patients.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2024;74:229-63. [Crossref] [PubMed]
  2. Yang H, Wang F, Hallemeier CL, et al. Oesophageal cancer. Lancet 2024;404:1991-2005. [Crossref] [PubMed]
  3. Yin J, Yuan J, Li Y, et al. Neoadjuvant adebrelimab in locally advanced resectable esophageal squamous cell carcinoma: a phase 1b trial. Nat Med 2023;29:2068-78. [Crossref] [PubMed]
  4. Ji X, Cai J, Chen Y, et al. Lymphatic spreading and lymphadenectomy for esophageal carcinoma. World J Gastrointest Surg 2016;8:90-4. [Crossref] [PubMed]
  5. Nguyen AT, Pham VH, Tran MT, et al. Lymph node metastases status in esophageal squamous cell carcinoma following neoadjuvant chemoradiotherapy: a single-center cross-sectional study. Transl Gastroenterol Hepatol 2025;10:8. [Crossref] [PubMed]
  6. Hsu WH, Hsu PK, Wang SJ, et al. Positron emission tomography-computed tomography in predicting locoregional invasion in esophageal squamous cell carcinoma. Ann Thorac Surg 2009;87:1564-8. [Crossref] [PubMed]
  7. Peyre CG, Hagen JA, DeMeester SR, et al. The number of lymph nodes removed predicts survival in esophageal cancer: an international study on the impact of extent of surgical resection. Ann Surg 2008;248:549-56. [Crossref] [PubMed]
  8. Yang H, Liu H, Chen Y, et al. Neoadjuvant Chemoradiotherapy Followed by Surgery Versus Surgery Alone for Locally Advanced Squamous Cell Carcinoma of the Esophagus (NEOCRTEC5010): A Phase III Multicenter, Randomized, Open-Label Clinical Trial. J Clin Oncol 2018;36:2796-803. [Crossref] [PubMed]
  9. Duan X, Shang X, Yue J, et al. A nomogram to predict lymph node metastasis risk for early esophageal squamous cell carcinoma. BMC Cancer 2021;21:431. [Crossref] [PubMed]
  10. Newman N B, Jethwa K R. Is metastasis-directed local therapy the new standard of care for patients with oligometastasic esophageal squamous cell carcinoma?-a perspective on the ESO-Shanghai 13 Trial. Chin Clin Oncol 2025;14:13. [Crossref] [PubMed]
  11. Zheng H, Tang H, Wang H, et al. Nomogram to predict lymph node metastasis in patients with early oesophageal squamous cell carcinoma. Br J Surg 2018;105:1464-70. [Crossref] [PubMed]
  12. Huang X, Wang Q, Xu W, et al. Machine learning to predict lymph node metastasis in T1 esophageal squamous cell carcinoma: a multicenter study. Int J Surg 2024;110:7852-9. [Crossref] [PubMed]
  13. Liang J, Zhao Z, Xie Y, et al. Identification and validation of LINC02381 as a biomarker associated with lymph node metastasis in esophageal squamous cell carcinoma. Transl Cancer Res 2025;14:613-25. [Crossref] [PubMed]
  14. Liu J, Dai L, Wang Q, et al. Multimodal analysis of cfDNA methylomes for early detecting esophageal squamous cell carcinoma and precancerous lesions. Nat Commun 2024;15:3700. [Crossref] [PubMed]
  15. Xue L, Zhao Z, Wang M, et al. A liquid biopsy signature predicts lymph node metastases in T1 oesophageal squamous cell carcinoma: implications for precision treatment strategy. Br J Cancer 2022;127:2052-9. [Crossref] [PubMed]
  16. Xie Y, Zhang Z, Lai D, et al. Lymph node metastasis-related lncRNA GAS6-AS1 facilitates the progression of esophageal squamous cell carcinoma. J Gastrointest Oncol 2023;14:2293-308. [Crossref] [PubMed]
  17. Wei H, Fang R, Zhang S, et al. Current advances in the functional role of long non-coding RNAs in the oncogenesis and metastasis of esophageal squamous cell carcinoma: a narrative review. Transl Cancer Res 2025;14:2150-67. [Crossref] [PubMed]
  18. Yu J, Hu W, Yao N, et al. Development and validation of a nomogram to predict overall survival of T1 esophageal squamous cell carcinoma patients with lymph node metastasis. Transl Oncol 2021;14:101127. [Crossref] [PubMed]
  19. Bhinder B, Gilvary C, Madhukar NS, et al. Artificial Intelligence in Cancer Research and Precision Medicine. Cancer Discov 2021;11:900-15. [Crossref] [PubMed]
  20. Zhang L, Yu S, Yin X, et al. MiR-942-5p inhibits tumor migration and invasion through targeting CST1 in esophageal squamous cell carcinoma. PLoS One 2023;18:e0277006. [Crossref] [PubMed]
  21. Tang WW, Liu ZH, Yang TX, et al. Upregulation of MAGEA4 correlates with poor prognosis in patients with early stage of esophageal squamous cell carcinoma. Onco Targets Ther 2016;9:4289-93. [Crossref] [PubMed]
  22. Chen H, Yao J, Bao R, et al. Cross-talk of four types of RNA modification writers defines tumor microenvironment and pharmacogenomic landscape in colorectal cancer. Mol Cancer 2021;20:29. [Crossref] [PubMed]
  23. Rachidi M, Lopes C, Charron G, et al. Spatial and temporal localization during embryonic and fetal human development of the transcription factor SIM2 in brain regions altered in Down syndrome. Int J Dev Neurosci 2005;23:475-84. [Crossref] [PubMed]
  24. Su P, Wen S, Zhang Y, et al. Identification of the Key Genes and Pathways in Esophageal Carcinoma. Gastroenterol Res Pract 2016;2016:2968106. [Crossref] [PubMed]
  25. Halvorsen OJ, Rostad K, Øyan AM, et al. Increased expression of SIM2-s protein is a novel marker of aggressive prostate cancer. Clin Cancer Res 2007;13:892-7. [Crossref] [PubMed]
  26. Yan Q, Huang S, Zhou M, et al. SND1-SMARCA5 interaction strengthened by PIM promotes the proliferation, metastasis, and chemoresistance of esophageal squamous cell carcinoma. Int J Biol Macromol 2025;291:139152. [Crossref] [PubMed]
  27. Li Q, Liu XL, Jiang N, et al. A new prognostic model for RHOV, ABCC2, and CYP4B1 to predict the prognosis and association with immune infiltration of lung adenocarcinoma. J Thorac Dis 2023;15:1919-34. [Crossref] [PubMed]
  28. Imaoka S, Yoneda Y, Sugimoto T, et al. CYP4B1 is a possible risk factor for bladder cancer in humans. Biochem Biophys Res Commun 2000;277:776-80. [Crossref] [PubMed]
  29. Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism 2017;69S:S36-40. [Crossref] [PubMed]
  30. Yu D, Yang J, Wang B, et al. New genetic insights into immunotherapy outcomes in gastric cancer via single-cell RNA sequencing and random forest model. Cancer Immunol Immunother 2024;73:112. [Crossref] [PubMed]
  31. Wang Y, Xing S, Xu YW, et al. Highly sensitive detection platform-based diagnosis of oesophageal squamous cell carcinoma in China: a multicentre, case-control, diagnostic study. Lancet Digit Health 2024;6:e705-17. [Crossref] [PubMed]
  32. Yuan P, Huang ZH, Yang YH, et al. A (18)F-FDG PET/CT-based deep learning-radiomics-clinical model for prediction of cervical lymph node metastasis in esophageal squamous cell carcinoma. Cancer Imaging 2024;24:153. [Crossref] [PubMed]
  33. Mohapatra S, Al Ghamdi SS, Charilaou P, et al. Predictors for lymph node metastasis and survival of patients with T1b esophageal adenocarcinoma treated with surgery and endoscopic therapy: an analysis of the Surveillance, Epidemiology, and End Results database. Gastrointest Endosc 2024;100:849-56. [Crossref] [PubMed]
  34. Weksler B, Kennedy KF, Sullivan JL. Using the National Cancer Database to create a scoring system that identifies patients with early-stage esophageal cancer at risk for nodal metastases. J Thorac Cardiovasc Surg 2017;154:1787-93. [Crossref] [PubMed]
  35. Ichimasa K, Foppa C, Kudo SE, et al. Artificial Intelligence to Predict the Risk of Lymph Node Metastasis in T2 Colorectal Cancer. Ann Surg 2024;280:850-7. [Crossref] [PubMed]
Cite this article as: Zhao Z, Xie Y, Lai D, Liang J, Okereke IC, Lin W. Transcriptome analysis and artificial intelligence for predicting lymph node metastasis of esophageal squamous cell carcinoma. J Thorac Dis 2025;17(5):3283-3296. doi: 10.21037/jtd-2025-662

Download Citation