External validation of a prediction model for pathologic N2 among patients with a negative mediastinum by positron emission tomography
Introduction
Mediastinal nodal evaluation is the cornerstone of staging patients with non-metastatic non-small cell lung cancer (NSCLC). Practice guidelines recommend routine invasive mediastinal staging for patients with a positive mediastinum by positron emission tomography (PET) (1,2). When the mediastinum is negative by PET, guidelines recommend selective invasive staging based on the presence or absence of other radiographic risk factors for mediastinal nodal disease—for example, tumor size and/or location, mediastinal lymphadenopathy, or fluorodeoxyglucose (FDG) uptake in N1 nodes. Because there are multiple risk factors for unsuspected N2 (3-9), there is an opportunity to use risk-prediction to facilitate better patient selection for invasive diagnostic procedures.
Several groups have developed prediction models to improve patient selection for invasive mediastinal staging (10-14). One group of investigators recently developed and internally validated a prediction model for pathologic N2 (pN2) using six pre-operative risk-factors among North American patients with a negative mediastinum by PET (14). An important limitation of this work and other contemporary studies is that validation occurred at only one institution—specifically the same institution where the model was developed. The goal of this investigation was to evaluate the performance of the previously internally validated prediction model for pN2 at another North American site. A secondary aim was to explore the risk-prediction model’s potential impact on care.
Materials and methods
A retrospective investigation (November 2005-March 2013) was conducted of NSCLC patients with a negative mediastinum by PET who underwent nodal evaluation by invasive staging and/or at the time of pulmonary resection at the University of Washington. The Institutional Review Board approved this study and waived the need for consent for this minimal risk retrospective review (committee EA, approval number 44939). Study inclusion/exclusion criteria matched those of a prior North American study that originally developed and internally validated the prediction model under investigation (14). Patients were eligible for study if they were initially staged by computed tomography (CT) and PET and had no evidence of metastatic disease. Those ineligible for study included: suspicion of synchronous, metachronous, or recurrent lung cancer; evidence of T3 (except by size criteria) or T4 tumors by CT; and/or receipt of induction therapy without pathologic confirmation of nodal disease. The American Joint Committee on Cancer classification for T-status changed over the study period such that tumors greater than 7 cm were considered T3 rather than T2 after 2010. Because the population in which the prediction model was originally developed included patients regardless of size, we allowed for inclusion of subjects with T3 tumors defined exclusively by size criteria.
The previously published North American prediction model utilized logistic regression to estimate the probability of pN2 based on six risk factors ascertained prior to treatment—tumor size by CT, tumor location by CT (central versus peripheral), extent of nodal disease by CT (lymphadenopathy defined by size greater than 1 cm), maximum standardized uptake value (SUVmax) of the primary tumor, FDG uptake in ipsilateral hilar nodes, and tumor histology if the patient underwent a pretreatment biopsy of the primary lesion (14). Coefficient estimates from this model were used to estimate the probability of pN2 for each patient. Like the original prediction model study (14), radiology and pathology reports were used to ascertain model inputs and the true status of mediastinal lymph nodes—defined by pathologic confirmation of disease in nodal tissue.
Model performance was evaluated using the same metrics utilized by the internal validation study (14)—discrimination, calibration (fit), and accuracy. A c-statistic describes a model’s ability to discriminate between two outcomes. The metric ranges from 0.5 (no different than a coin-toss) to 1.0 (perfect discrimination among two outcomes). Values greater than 0.7 indicate reasonable discriminatory ability, and values greater than 0.8 indicate strong discriminatory ability (15). A non-significant goodness-of-fit test indicates good fit. Visual evaluation of calibration (fit) was also performed by plotting actual versus predicted pN2 rates. In order to describe the model’s accuracy for patient selection in terms of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), patients must be dichotomized into high-versus low-risk groups based on a threshold probability. The original prediction model used an empirical method called the Youden Index to identify the cut-off that maximizes the sum of sensitivity and specificity (16). A probability of pN2 disease >8.3% defined the high-risk group in the original prediction model study (14), and this threshold was used for the current investigation.
A secondary hypothesis-generating aim of this study was to explore the potential impact of the prediction model on care by comparing the frequency and accuracy of patient selection for invasive staging across varying assumptions (Tables S1,S2,S3). Usual care at the University of Washington historically consisted of an aggressive but guideline allowable staging strategy (1,2) of invasive mediastinal staging for all patients except for those with a relative contraindication and/or those in whom management would not be expected to change based on mediastinal nodal status. We did not use the prediction model as a part of usual care because it was not available until 2012 and because we considered it investigational without evidence of external validity. The accuracy of patient selection with usual care was calculated using a 2×2 table of patients who actually underwent invasive staging. Minimum practice guideline recommendations for selective staging essentially require all patients to undergo invasive staging except those with a radiographic, peripheral stage IA tumor (1,2). The accuracy of patient selection using the minimum recommended standards for invasive staging was estimated using a 2×2 table after classifying patients as having a peripheral stage IA or not. The accuracy of patient selection using the risk-prediction model was estimated using a 2×2 table after classifying patients as being high- or low-risk using the probability cut-off described earlier.
Full table
Full table
Full table
STATA (Special Edition 12.1; Statacorp, College Station, Texas, USA) was used to conduct all statistical analyses. Binary variables were summarized using proportions and 95% exact binomial confidence intervals (CI).
Results
Table 1 provides an overall summary of the cohort. Nearly every patient underwent invasive staging with a median of 3 (range, 0-5) mediastinal lymph node stations sampled. Two patients with pN2 identified by invasive staging did not undergo operative management because one had single-station bulky disease and the other opted for definitive chemoradiation therapy. Intra-operatively, a median of 1 (range, 0-8) mediastinal lymph node station was sampled. The overall median number of mediastinal lymph node stations sampled by any means (invasive staging or intra-operatively) was 3 (range, 1-9), with 86% of patients having at least three mediastinal nodal stations sampled. A total of 18 patients had pN2 (7.5%, 95% CI: 4.5-12%)—all with single-station disease. Table 2 shows the distribution of risk factors and the prevalence of pN2 across each risk factor. The median primary tumor size by CT and SUVmax were 2.5 cm and 5.6, respectively. Patients with FDG uptake in N1 nodes had the highest prevalence of pN2 disease (25%).
Full table
Full table
Table 3 summarizes model performance for the external validation cohort as well as the reported performance of the original development and validation cohorts. Model discrimination was excellent in our external validation cohort and significantly higher than that reported in the internal validation cohort (Figure 1). The model fit the data well across all cohorts (P>0.05). Figure 2 reveals that the model tended to overestimate the risk of pN2 in the external validation cohort. Model sensitivity, specificity, PPV, and NPV were not significantly different across cohorts. A post-hoc analysis was conducted among patients with at least three nodal stations sampled by any means (n=205, pN2 =7.3%). Model performance was similar to that observed in our primary analysis [c-statistic 0.80 (95% CI: 0.73-0.85), goodness-of-fit test P=0.50, sensitivity 100% (95% CI: 78-100), specificity 46% (95% CI: 39-54%), PPV 13% (95% CI: 7.3-20%), and NPV 100% (95% CI: 96-100%)]. An additional post-hoc analysis was conducted among patients with at least three nodal stations sampled intraoperatively (n=58, pN2 =12%). Again model performance was similar to that observed in our primary analysis except that the PPV was significantly higher [c-statistic 0.78 (95% CI: 0.65-0.87), goodness-of-fit test P=0.62, sensitivity 100% (95% CI: 59-100%), specificity 59% (95% CI: 44-72%), PPV 72% (95% CI: 53-87%), and NPV 100% (95% CI: 88-100%)].
Full table
Table 4 summarizes the frequency and accuracy of patient selection for invasive staging across varying assumptions. Usual care—representing an aggressive but guideline allowable invasive staging strategy—was characterized by the highest utilization of invasive procedures and lowest accuracy in terms of patient selection. Had the minimum practice guideline recommended indications for invasive staging been adhered to, use of invasive procedures would have been significantly less, and the specificity and NPV of patient selection would have been significantly higher. Compared to both usual care and the minimum recommended indications for invasive staging, use of the prediction model would have resulted in even less use of invasive diagnostic tests and further improvements in the specificity of patient selection.
Full table
Discussion
The clinical application of a prediction model for nodal disease in operable lung cancer patients is to guide the use of invasive staging procedures prior to first treatment. Patients at high-risk for nodal disease will still require invasive staging in order to characterize the extent of disease (i.e., N1, single-station N2, multi-station N2, N3). However, patients at low-risk may proceed directly to resection with intraoperative nodal evaluation. A few prediction models have been developed for mediastinal staging among NSCLC patients (10-14), but none have been externally validated. We report that a previously developed and internally validated prediction model for pN2 among North American patients with a negative mediastinum by PET is externally valid. The potential impact of such a model is a reduction in the use of invasive mediastinal staging procedures and improvements in patient selection for invasive diagnostic tests.
External validation revealed that model performance was at least as good as that observed in the original development and internal validation cohorts (14). Although model discrimination was significantly better in our external validation cohort, superior performance in this metric did not translate into observable improvements in the accuracy of patient selection for invasive staging. One potential explanation is the relatively small number of patients and events in this study. Importantly, however, we found no evidence of performance decrements in terms of discrimination, calibration (fit), or accuracy. The tendency of the model to overestimate risk was also observed in the internal validation study (14). On balance, the performance of the model appeared to be similar across two different centers among patient populations defined by the same inclusion/exclusion criteria.
External validation of risk-prediction models is important because the ascertainment and/or prevalence of risk-factors can vary across patient populations at different centers despite identical inclusion/exclusion criteria. Compared to the population in which the prediction model was originally developed and validated (14), patients in our study tended to have slightly larger tumors radiographically, higher SUVmax, and a higher frequency of lymphadenopathy by CT and FDG-uptake in N1 nodes. Interestingly, despite a higher prevalence of radiographic risk factors for pN2, the overall rate of pN2 did not vary significantly across studies. These findings support conventional wisdom suggesting variation in the interpretation of advanced imaging across centers. Although the reasons underlying such variation are often unknown, in some cases they are expected. For instance, factors that influence SUV measurement include the time from FDG injection to imaging, type of image reconstruction algorithm, reconstruction filter; scan length, and attenuation correction methods (17). The significance of variation in interpreting advanced imaging is that model performance may also vary from one site to another. While it is desirable to mitigate practice variation, it is not always practical to do so. For this reason, it is imperative to externally validate the performance of risk-prediction models across varying settings. The similar performance of the prediction model across two different North American centers provides evidence in support of its generalizability. However, one important limitation of the current analysis is that model performance was evaluated at another high-volume, academic, National Cancer Institute (NCI)-designated comprehensive cancer center. Future studies should evaluate model performance across multiple environments including non-academic centers; non-NCI designated cancer centers, low-volume hospitals, and integrated health systems.
Our exploratory aim suggests that use of a risk-prediction model for mediastinal staging may lead to a decreased need for invasive diagnostic tests through better patient selection. To the extent that diagnostic accuracy, appropriate treatment selection, and patient outcomes are not adversely impacted by the use of a risk-prediction model, a reduction in the number of invasive diagnostic tests will increase the value of thoracic oncologic care. Certainly, fewer procedures will translate into less exposure to procedure-related risks and lower costs. The efficiency of care may also improve by omitting one step (invasive staging) in a complex diagnostic algorithm for working up suspected or confirmed NSCLC (1,2). Another potential benefit of risk-prediction is reducing unnecessary provider-level variation in the use of invasive procedures. Practice variation in the use of invasive staging modalities has not yet been described at the population level, but it is strongly suspected based on clinical experience. Importantly, the use of a risk-prediction model does not eliminate patient-level variation in the use of invasive staging procedures arising from differences in individual-level risk. Risk-prediction provides an opportunity to standardize provider- or institutional-level approaches to lung cancer staging while ensuring the delivery of personalized cancer care.
This study has several important limitations. Our approach to defining true pathologic nodal status is antiquated, even though it was based on a conventional criterion appropriate for the period of time under study. The traditional gold standard definition for mediastinal nodal disease is pathologic confirmation of disease in nodal tissue. However, this definition does not take into account the thoroughness of mediastinal nodal evaluation (18). A secondary analysis of the ACOSOG Z0030 trial revealed that 99% of patients had at least three nodal stations sampled (19). Subsequently, contemporary practice guidelines and quality improvement initiatives regard assessment of at least three nodal stations to be the standard (2,20). One potential consequence of lymph node sampling is that the true prevalence of nodal disease is underestimated, which may in turn falsely elevate sensitivity and falsely decrease specificity. A single institution study evaluated the prevalence of pN2 when usual care consists of routine lymphadenectomy and reported a prevalence of pN2 only slightly higher (10%) than our rate of pN2 (21). Furthermore, a post-hoc analysis restricted to patients in our study that had at least three nodal stations sampled (by any means or intraoperatively) revealed model performance similar to that observed in the internal validation study. We found no evidence to suggest that the pragmatic nature of our retrospective study has substantially biased results pertaining to model performance. Future validation studies should ideally define the gold standard for determining nodal disease by both pathologic confirmation of disease and adequate extent of mediastinal nodal assessment.
One potential criticism of our study is a missed opportunity to evaluate the performance of other previously published prediction models for pN2. All four prior models/studies did not use information from PET to define the population under investigation and/or include as model inputs. One North American study evaluated patients prior to the availability of PET (10). Two studies from China and one from Japan alluded to the selective use of PET in their countries because of high costs and/or delayed adoption (11-13). Because PET is the standard of care in North American, we only considered validating the prediction model developed among a population that routinely underwent PET. As a consequence, the prediction model under study may only be generalizable to patients undergoing routine PET. Finally, another potential criticism of this study is that the availability of minimally-invasive endosonography obviates the need for risk-prediction. However, practice guidelines recommending the use of endosonography for invasive mediastinal staging emphasize the importance of available equipment and the skill and experience of the operator, which may not yet exist in all practice settings (1,2). Furthermore, the ASTER trial revealed that up to 53% of patients undergoing first-line endosonography will require surgical staging to definitely rule-out mediastinal disease (22). Finally, endosonography is not without risk (23). A risk-prediction model compliments new technology in reducing procedure-related risks for lung cancer patients.
In conclusion, a previously developed and internally validated prediction model for pN2 among patients with a negative mediastinum by PET is externally valid. The potential benefit of using risk-prediction is a reduction in the overall number of invasive diagnostic procedures achieved through better patient selection.
Acknowledgements
Authors’ contributions: F Farjah contributed substantially to the concept and study design, acquisition of data, data analysis and interpretation, and writing and revising the manuscript. JP Manning contributed substantially to the study design, acquisition of data, and revising the manuscript. LM Backhus, TK Varghese, AM Cheng, MS Mulligan, and DE Wood contributed substantially to the study design, interpretation of the data, and revision of the manuscript. All authors (F Farjah, LM Backhus, TK Varghese, JP Manning, AM Cheng, MS Mulligan, and DE Wood) have provided final approval of the version to be published and have agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Disclosure: Dr. Backhus has received payment as a consultant for Disney and Engine Room Production Company. Dr. Mulligan has received payment as a consultant for Covidien. Dr. Wood has received payment as a consultant for Spiration as well as grant support for Spiration-related research. Dr. Farjah is a Cancer Research Network Scholar and the recipient of a Cancer Research Network Pilot Grant (Grant Number 1U24 CA171524 from the NCI).
References
- Silvestri GA, Gonzalez AV, Jantz MA, et al. Methods for staging non-small cell lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest 2013;143:e211S-50S.
- National Comprehensive Cancer Network. Clinical practice guidelines in oncology–v.2.2013: Non-small cell lung cancer. NCCN Guidelines 2013.
- Al-Sarraf N, Aziz R, Gately K, et al. Pattern and predictors of occult mediastinal lymph node involvement in non-small cell lung cancer patients with negative mediastinal uptake on positron emission tomography. Eur J Cardiothorac Surg 2008;33:104-9. [PubMed]
- Cerfolio RJ, Bryant AS. Survival of patients with unsuspected N2 (stage IIIA) nonsmall-cell lung cancer. Ann Thorac Surg 2008;86:362-66; discussion 366-7. [PubMed]
- Defranchi SA, Cassivi SD, Nichols FC, et al. N2 disease in T1 non-small cell lung cancer. Ann Thorac Surg 2009;88:924-8. [PubMed]
- Kanzaki R, Higashiyama M, Fujiwara A, et al. Occult mediastinal lymph node metastasis in NSCLC patients diagnosed as clinical N0-1 by preoperative integrated FDG-PET/CT and CT: Risk factors, pattern, and histopathological study. Lung Cancer 2011;71:333-7. [PubMed]
- Lee PC, Port JL, Korst RJ, et al. Risk factors for occult mediastinal metastases in clinical stage I non-small cell lung cancer. Ann Thorac Surg 2007;84:177-81. [PubMed]
- Park HK, Jeon K, Koh WJ, et al. Occult nodal metastasis in patients with non-small cell lung cancer at clinical stage IA by PET/CT. Respirology 2010;15:1179-84. [PubMed]
- Trister AD, Pryma DA, Xanthopoulos E, et al. Prognostic value of primary tumor FDG uptake for occult mediastinal lymph node involvement in clinically N2/N3 node-negative non-small cell lung cancer. Am J Clin Oncol 2014;37:135-9. [PubMed]
- Shafazand S, Gould MK. A clinical prediction rule to estimate the probability of mediastinal metastasis in patients with non-small cell lung cancer. J Thorac Oncol 2006;1:953-9. [PubMed]
- Koike T, Koike T, Yamato Y, et al. Predictive risk factors for mediastinal lymph node metastasis in clinical stage IA non-small-cell lung cancer patients. J Thorac Oncol 2012;7:1246-51. [PubMed]
- Zhang Y, Sun Y, Xiang J, et al. A prediction model for N2 disease in T1 non-small cell lung cancer. J Thorac Cardiovasc Surg 2012;144:1360-4. [PubMed]
- Chen K, Yang F, Jiang G, et al. Development and validation of a clinical prediction model for N2 lymph node metastasis in non-small cell lung cancer. Ann Thorac Surg 2013;96:1761-8. [PubMed]
- Farjah F, Lou F, Sima C, et al. A prediction model for pathologic N2 disease in lung cancer patients with a negative mediastinum by positron emission tomography. J Thorac Oncol 2013;8:1170-80. [PubMed]
- Hosmer DW, Lemeshow S. Applied logistic regression, second edition. New York: John Wiley and Sons; 2000.
- Youden WJ. Index for rating diagnostic tests. Cancer 1950;3:32-5. [PubMed]
- Toney LK, Vesselle HJ. Neural networks for nodal staging of non-small cell lung cancer with FDG PET and CT: importance of combining uptake values and sizes of nodes and primary tumor. Radiology 2014;270:91-8. [PubMed]
- Detterbeck F, Puchalski J, Rubinowitz A, et al. Classification of the thoroughness of mediastinal staging of lung cancer. Chest 2010;137:436-42. [PubMed]
- Darling GE, Allen MS, Decker PA, et al. Randomized trial of mediastinal lymph node sampling versus complete lymphadenectomy during pulmonary resection in the patient with N0 or N1 (less than hilar) non-small cell carcinoma: results of the American College of Surgery Oncology Group Z0030 Trial. J Thorac Cardiovasc Surg 2011;141:662-70. [PubMed]
- Katlic MR, Facktor MA, Berry SA, et al. ProvenCare lung cancer: a multi-institutional improvement collaborative. CA Cancer J Clin 2011;61:382-96. [PubMed]
- Cerfolio RJ, Bryant AS, Minnich DJ. Complete thoracic mediastinal lymphadenectomy leads to a higher rate of pathologically proven N2 disease in patients with non-small cell lung cancer. Ann Thorac Surg 2012;94:902-6. [PubMed]
- Annema JT, van Meerbeeck JP, Rintoul RC, et al. Mediastinoscopy vs endosonography for mediastinal nodal staging of lung cancer: a randomized trial. JAMA 2010;304:2245-52. [PubMed]
- von Bartheld MB, van Breda A, Annema JT. Complication rate of endosonography (endobronchial and endoscopic ultrasound): a systematic review. Respiration 2014;87:343-51. [PubMed]