Feasibility of extracting key elements from thoracic surgical operative notes using foundational large language models
Feasibility of extracting key elements from thoracic surgical operative notes using foundational large language models
Original Article
Feasibility of extracting key elements from thoracic surgical operative notes using foundational large language models
Vamshi Mugu1, Vera Sorin1, Alex Chan1, Casey Briggs2, Brendan Carr3, Mike Olson1, John Schupbach2, John Zietlow4, Sahar Saddoughi2, John Schmitz1, Ashish Khandelwal1
1Department of Radiology, Mayo Clinic, Rochester, MN, USA;
2Department of Thoracic Surgery, Mayo Clinic, Rochester, MN, USA;
3Department of Emergency Medicine, Mayo Clinic, Rochester, MN, USA;
4Department of Trauma, Critical Care, and General Surgery, Mayo Clinic, Rochester, MN, USA
Contributions: (I) Conception and design: V Mugu, V Sorin; (II) Administrative support: J Schmitz, A Khandelwal; (III) Provision of study materials or patients: V Mugu; (IV) Collection and assembly of data: V Mugu, V Sorin, C Briggs; (V) Data analysis and interpretation: V Mugu, V Sorin; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.
Correspondence to: Vamshi Mugu, MD, MS. Department of Radiology, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA. Email: mugu.vamshi@mayo.edu.
Background: Operative notes contain key information vital to subsequent care of the patient; however, manual extraction of this information can be time-consuming. Large language models (LLMs) offer promise in this regard. The aim of this study was to evaluate the feasibility of using LLMs to automatically extract key elements from thoracic surgical operative notes.
Methods: We instructed three LLMs, Gemini, ChatGPT, and LLaMA to extract key information from 116 thoracic surgery operative notes. A blinded thoracic surgeon assessed accuracy, while two blinded physicians (an emergency physician and a radiologist) recorded the time needed to extract the key elements (“approach”, “purpose of the procedure”, “complications”, and “devices left”) manually and using LLM construct.
Results: Gemini demonstrated the highest overall accuracy (95.3%), followed by ChatGPT (93.1%) and LLaMA (87.7%). The mean time to extract key elements was significantly lower when using LLMs: 22.1 vs. 4.1 seconds for the emergency physician and 30 vs. 3 seconds for the radiologist (P<0.001).
Conclusions: With appropriate human supervision, LLMs can be used to automatically extract key information from thoracic surgical operative notes. Using LLMs for this task can save clinician time significantly, without compromising accuracy. Due to the availability of several LLM choices, including open- and closed-source, comparative evaluation is beneficial to address specific needs and constraints.
Keywords: Large language models (LLMs); thoracic surgery; operative notes; generative artificial intelligence (generative AI)
Submitted Jun 30, 2025. Accepted for publication Sep 05, 2025. Published online Nov 26, 2025.
doi: 10.21037/jtd-2025-1318
Highlight box
Key findings
• Large language models (LLMs) can be used to extract key information from thoracic surgical operative notes.
What is known and what is new?
• The vital importance of operative notes to subsequent care of the patient is well known; however, the extraction of such information can be time-consuming.
• With their ability to process natural language, LLMs offer a time-saving mechanism in this regard.
What is the implication, and what should change now?
• With careful testing and appropriate human supervision, LLMs can be used to process operative notes, as demonstrated using thoracic surgical operative notes in this project.
Introduction
Thoracic surgical operative notes provide information on intraoperative findings and events and serve as a common language between surgical and non-surgical clinicians by documenting the surgical approach, implanted devices, and any complications encountered. These notes inform clinicians who were not present during surgery (1,2). This common language enables clinicians to provide optimal care to post-surgical patients, particularly in the event of unexpected postoperative clinical presentation. However, extracting information from lengthy reports can be tedious, time-consuming, and prone to errors, potentially leading to delays in clinical decision-making (3,4).
Large language models (LLMs) are the current state of the art in natural language processing (NLP) (5). These models have shown promise in various healthcare applications, including clinical text summarization (6), classification (7), and decision support (8). LLMs were also shown to improve patient-clinician communication through simplified terminology (9) and by providing empathetic responses (10). Despite the potential, LLMs face limitations. These include inaccuracies in outputs, biases, ethical considerations, and privacy concerns (11,12).
The aim of this study was to evaluate the potential of foundational (out-of-the-box) LLMs in summarizing thoracic surgical operative notes by extracting clinically relevant key elements, and to assess the potential impact of this approach on clinical workflows in terms of clinician time saved. We present this article in accordance with the STARD reporting checklist (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1318/rc).
Methods
This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional review board of the Mayo Clinic (IRB# 23-009099) and individual consent for this retrospective analysis was waived. We then retrospectively extracted 126 thoracic surgical operative notes for patients over the age of 18 years who underwent thoracic procedures between 07/01/2023 and 11/30/2023 at our institution. We excluded notes that contained any patient-protected health information within the clinical free-text or physician-specific identifiers, whereby 116 notes remained for analysis. Exclusions are depicted in Figure 1.
Figure 1 Inclusion and exclusion criteria. One hundred and sixteen notes remained for analysis after exclusions.
Each of the participating physicians was asked to identify a small set of key elements (approximately 5) in a surgical note based on clinical relevance. Of the received responses, four elements were chosen by majority consensus: “approach”, “purpose of the procedure”, “complications”, and “devices left”. “Approach” referred to the surgical approach to the procedure (such as gastroscopy, bronchoscopy, mediastinoscopy, and open). “Purpose of the procedure” detailed the primary intent, such as biopsy, resection, myotomy, dilation, or stent check. “Complications” included events such as hemorrhage. “Devices left” identified items such as chest tubes and endotracheal tubes.
Three LLMs, Gemini (1.5-Flash-002 model, Google), ChatGPT (version 4, OpenAI), and LLaMA (version 3.2-vision-11b, Meta) were used as foundational (out-of-the-box) models. Each note was processed in a separate instance, ensuring that the analysis of one note did not influence the processing of others. We used default settings for temperature, top-P, and top-K hyperparameters. The prompt that was used consistently for all models is detailed in Appendix 1. An example with the original surgical operative note and a sample output has been provided in Appendix 2.
A thoracic surgeon was asked to rate the accuracy of the LLM output, i.e., accurate extraction of key elements from the surgical operative notes. The surgeon was blinded to the specific model. The accuracy of each key element for each LLM and operative note was recorded as a “yes” or “no”, designating whether the surgeon found the key element to be accurate or not.
Two clinicians (a radiologist and an emergency physician) were provided with the list of 116 operative notes and were asked to record the time (in seconds) needed to manually extract the key elements from each note. These clinicians were then asked to record the time (in seconds) needed to extract the same key elements from the output of the three LLMs for each note. The clinicians were also blinded to the LLM models. The two tasks (extraction of key elements manually and by using LLM outputs) were also temporally separated to avoid cognitive (recall) bias.
Statistical analysis
All analyses were conducted using Python (version 3.11). Statistical computations were performed with Scipy (version 1.15.2, Enthought, Inc., Austin, TX, USA). Matplotlib (version 3.9.4) was used to generate illustrations. McNemar’s (matched Chi-squared test) was used to compare accuracies across the LLMs and LLMs vs. clinicians. A paired t-test was used to identify differences in the time needed to extract key elements by each of the two clinicians (radiologist and emergency physician). A P value of less than 0.05 was considered statistically significant. While ChatGPT (version 4) was used to generate some Python code for statistical analysis, artificial intelligence (AI) tools were not used to generate any specific content of the manuscript.
Results
Of the included 116 thoracic surgical operative notes, 61/116 (52.6%) were for male and 48/116 (41.4%) were for female patients, while sex was not specified for 7/116 patients (6%). The median age was 66 years (range, 24–84 years).
Overall accuracy in identifying key elements was highest for Gemini (95.3%), followed by ChatGPT (93.1%) and LLaMA (87.7%). The difference in accuracy between Gemini and ChatGPT was not statistically significant (P=0.055). Both Gemini and ChatGPT outperformed LLaMA (P<0.001 and P=0.003, respectively). The difference between proportions was 41.5% between Gemini and LLaMA with 95% confidence interval of (38%, 45%). The difference between proportions was 40.4% between ChatGPT and LLaMA with 95% confidence interval of (36.8%, 44%). Overall accuracy is plotted in Figure 2.
Figure 2 Overall accuracy. Gemini had the highest overall accuracy followed by ChatGPT and LLaMA. Both Gemini and ChatGPT outperformed LLaMA (P<0.001 and P=0.003, respectively).
The accuracy in extracting individual key elements varied between the models. When extracting the “approach”, all three models demonstrated high and comparable accuracy (Gemini: 97.4%, ChatGPT: 96.6%, LLaMA: 92.2%), with no significant differences observed. When extracting “purpose of the procedure”, LLaMA had the highest accuracy (99.1%), outperforming ChatGPT (92.2%, P=0.03), but comparable to Gemini (95.7%). The difference between proportions was 45.7% between LLaMA and ChatGPT with 95% confidence interval of (38.4%, 53%). When extracting “complications”, Gemini achieved the highest accuracy (98.3%), followed by ChatGPT (96.6%) and LLaMA (92.2%), but differences were not statistically significant. When extracting “devices left”, both Gemini (89.7%) and ChatGPT (87.9%) significantly outperformed LLaMA (61.2%, P<0.001). The difference between proportions was 25.4% between Gemini and LLaMA with 95% confidence interval of (18.5%, 32.4%). The difference between proportions was 24.6% between ChatGPT and LLaMA with 95% confidence interval of (17.5%, 31.7%). Accuracies by key elements are plotted in Figure 3.
Figure 3 Accuracy of key elements. Accuracies varied by key elements and LLMs. Accuracy varied by key element and model, ranging from 61.2% to 99.1%. (A) Approach. (B) Purpose. (C) Complications. (D) Devices left. LLM, large language model.
The mean time to extract key elements from the operative reports, as reported by the emergency physician, was 22.1 seconds manually compared to 4.1 seconds using LLMs (P<0.001). The radiologist reported a mean time of 30 seconds manually vs. 3 seconds with LLMs (P<0.001). These are depicted in Figure 4.
Figure 4 Time to extract key elements. Both the emergency physician and the radiologist noted that significantly less time was needed to extract key elements from LLM outputs compared to manual extraction (P<0.001). (A) Approach. (B) Purpose. (C) Complications. (D) Devices left. LLM, large language model.
Discussion
Surgical operative notes contain critical information of events that transpired during the procedure and serve as a mode of communication for providers who care for postsurgical patients across all clinical care settings as well as in situations where patients are unable to describe these details themselves (1). These operative notes provide a detailed description of the main surgical procedure(s), additional procedures performed (e.g., biopsy), critical anatomy, complications, and information needed for future operations. Furthermore, operative notes may also contain post-operative treatment plans, such as type and duration of antibiotics or anticoagulation medications prescribed. Efficient availability of these details is bound to empower clinicians in providing continuity and coordination of post-operative care (3).
When a post-surgical patient presents to the emergency department, various clinicians across specialties collaborate to provide optimal care, including members from the emergency department, radiology, and other surgical specialties. During this investigative phase, the surgical operative notes are often reviewed to guide appropriate ordering of imaging studies and for directing immediate patient disposition. In addition to being aware of the normal postoperative findings, important questions such as “What surgery was performed and was it done in a laparoscopic fashion or open?”, “Were there any complications that occurred during the procedure?”, “Is the amount of pneumoperitoneum appropriate for the procedure performed, and if not, what else can be the cause?”, “Is the altered surgical anatomy expected for the surgery stated?”, “Were there retained hemostatic agents that may mimic abscess or other foreign object?”, and “Is the tube or drain in the intended location?”, remain in constant flux and operative notes are often pivotal in arriving at answers. Planning any effective intervention after the investigative phase, such as draining a post-surgical abscess, is bound to benefit from the information embedded within operative notes (4). For example, certain surgical reconstructions should be avoided, such as tissue flaps that may have a vulnerable vascular supply or implanted mesh that can be traversed by smaller needles but not by typical drainage catheters. The surgical operative note is crucial in this situation since it differentiates potential mimics such as hemostatic agents and sealants from normal anatomy (13).
Due to the complexity, length, and time-intensive nature of clinical note processing, several researchers attempted to use LLMs in automation of this task (14). Lee et al., for example, assessed whether a language model can automatically extract relevant information from free-text operative reports in cardiac surgery registries (15). Bombieri et al. pretrained a model for capturing surgical vocabulary provided in books and academic papers (16). Van Veen et al. attempted to use LLMs to summarize radiology reports, patient questions, progress notes, and doctor-patient communication and found that LLMs were non-inferior to medical experts in these tasks (17,18). In spite of these reported successes, several studies advocate cautious use of LLMs in medicine. Goodman et al., for example, found that AI-generated clinical summaries lacked the accuracy required for medical use primarily due to the probabilistic nature of the underlying mathematics (different answers could be generated for the same question by the same model) and propensity to generate plausible but non-factual information, a behavior termed as hallucination (19). The central role of a human (physician), often termed human-in-the-loop, is emphasized by researchers such as Hartman et al., in ensuring patient safety when using LLMs (20,21).
A collaborative effort among emergency physicians, radiologists, general surgeons, and thoracic surgeons in exploring the capabilities of LLM in the extraction of clinically significant key elements from surgical operative notes is lacking in literature to our knowledge. The impetus for this study is a testament to real-life interactions and workflows between emergency physicians, thoracic surgeons, general surgeons, and radiology, that begins with acute postsurgical patients presenting to the emergency department. A tool such as this could perhaps be incorporated as a supplement to the electronic medical record and has a potential to help timely and standardized extraction of useful information from surgical notes, thereby increasing the efficiency of radiology clinical workflow and increasing the accuracy and relevance of the output of the workflow. Because of the recent success of applying LLMs on text-based healthcare data, our proof-of-concept study sought to evaluate LLMs for extracting key elements from thoracic surgical notes using a carefully crafted prompt based on several important questions posed above. These elements were determined by participating physician consensus and guide initial clinical evaluation in the emergency department, investigations with radiologic studies, and potential downstream radiologic interventions, and include “approach”, “purpose of the procedure”, “complications”, and “devices left”. The accuracy of each LLM output on all operative notes was validated by a thoracic surgeon, ranging from 87.7% to 95.3%. Additionally, a blinded emergency physician and a radiologist were able to extract the key elements from LLM output in significantly less time, compared to manual extraction from surgical operative notes (P<0.001). In order to mitigate some concerns in using LLMs for tasks such as key element extraction from operative notes, the authors propose direct human oversight of these LLMs and deployment in controlled environments, such as in a firewalled open-source architecture, utilizing rigorous standards that preserve the patient’s well-being. While this design was largely adhered to during the feasibility testing of the current project, the details of the design itself are deemed beyond the scope of the current undertaking. We hope that our successful feasibility study paves the way for future research on expanding LLM use, albeit cautious, to other surgical arenas, spanning research, education, and practice.
Our study is not without limitations. First, designed as a retrospective single-institutional study using a focused dataset based on thoracic procedures, our study limits immediate generalizability of our findings. Second, we used only one prompt across all three LLMs. While a single prompt introduces simplicity and enables unbiased comparison, we did not explore the potential of LLM output reliability and variable outputs if additional prompts were used with controlled word changes (22). Third, our study was conducted without optimization of LLM parameters. Fourth, while we recognize that the small set of extracted features does not cover the entire spectrum of important information within surgical operative reports, and other details, such as date of surgery, can be crucial during the clinical evaluation. Information such as the date of surgery, while easy to extract, was deliberately excluded in this evaluation phase to ensure Health Insurance Portability and Accountability Act (HIPAA) compliance, but could be easily incorporated in future phases. While our assessment was limited to “out of the box” LLMs for establishing feasibility, we certainly envision increased efficiency and accuracy when techniques such as pre-training and fine-tuning are employed. Lastly, it must be strongly emphasized that while the current study employed a controlled environment for feasibility analysis with careful manual removal of protected health information, any use of technology in a sensitive environment such as healthcare should be carefully evaluated to protect such information and disseminate it only as needed. Interestingly, several LLM-based tools have been discussed to address such needs (23).
Conclusions
Our feasibility study demonstrates that several LLMs show promising accuracy in extracting key information from thoracic surgical operative notes. We also demonstrated that LLMs can save significant clinician time used in extracting this information. However, LLMs have limitations, and human supervision remains necessary to ensure accuracy and patient safety.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jtd.amegroups.com/article/view/10.21037/jtd-2025-1318/coif). B.C. reports a know-how agreement with Quai.MD. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional review board of the Mayo Clinic (IRB# 23-009099) and individual consent for this retrospective analysis was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
Ebbers T, Kool RB, Smeele LE, et al. The Impact of Structured and Standardized Documentation on Documentation Quality; a Multicenter, Retrospective Study. J Med Syst 2022;46:46. [Crossref] [PubMed]
Hassan RE, Akbar I, Khan AU, et al. A Clinical Audit of Operation Notes Documentation and the Impact of Introducing an Improved Proforma: An Audit Cycle. Cureus 2023;15:e50281. [Crossref] [PubMed]
Toru HK, Aizaz M, Orakzai AA, et al. Improving the Quality of General Surgical Operation Notes According to the Royal College of Surgeons (RCS) Guidelines: A Closed-Loop Audit. Cureus 2023;15:e48147. [Crossref] [PubMed]
Oladeji EO, Singh S, Kastos K. Improving Compliance With Operative Note Guidelines Through the Implementation of an Electronic Proforma. Cureus 2022;14:e32222. [Crossref] [PubMed]
Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023;29:1930-40. [Crossref] [PubMed]
Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med 2023;6:158. [Crossref] [PubMed]
Zhang X, Talukdar N, Vemulapalli S, et al. Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes. AMIA Jt Summits Transl Sci Proc 2024;2024:478-87.
Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024;30:2613-22. [Crossref] [PubMed]
Zaretsky J, Kim JM, Baskharoun S, et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw Open 2024;7:e240357. [Crossref] [PubMed]
Sorin V, Brin D, Barash Y, et al. Large Language Models and Empathy: Systematic Review. J Med Internet Res 2024;26:e52597. [Crossref] [PubMed]
Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024;310:e232756. [Crossref] [PubMed]
OmarMSofferSAgbareiaRSocio-demographic biases in medical decision-making by large language models: a large-scale multi-model analysis.medRxiv:2024.10.29.24316368 [Preprint]. 2024. Available online: https://www.medrxiv.org/content/10.1101/2024.10.29.24316368v1
Morani AC, Platt JF, Thomas AJ, et al. Hemostatic Agents and Tissue Sealants: Potential Mimics of Abdominal Abnormalities. AJR Am J Roentgenol 2018;211:760-6. [Crossref] [PubMed]
Apathy NC, Rotenstein L, Bates DW, et al. Documentation dynamics: Note composition, burden, and physician efficiency. Health Serv Res 2023;58:674-85. [Crossref] [PubMed]
Lee J, Sharma I, Arcaro N, et al. Automating surgical procedure extraction for society of surgeons adult cardiac surgery registry using pretrained language models. JAMIA Open 2024;7:ooae054. [Crossref] [PubMed]
Bombieri M, Rospocher M, Ponzetto SP, et al. Surgicberta: a pre-trained language model for procedural surgical language. Int J Data Sci Anal 2024;18:69-81.
Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024;30:1134-42. [Crossref] [PubMed]
Van Veen D, Van Uden C, Blankemeier L, et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res Sq [Preprint] 2023. doi: 10.21203/rs.3.rs-3483777/v1. Update in: Nat Med 2024;30:1134-42.
Goodman KE, Yi PH, Morgan DJ. AI-Generated Clinical Summaries Require More Than Accuracy. JAMA 2024;331:637-8. [Crossref] [PubMed]
Hartman V, Zhang X, Poddar R, et al. Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes. JAMA Netw Open 2024;7:e2448723. [Crossref] [PubMed]
Huang L, Yu W, Ma W, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst 2025;43:1-55.
Wang L, Chen X, Deng X, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med 2024;7:41. [Crossref] [PubMed]
Langenbach MC, Foldyna B, Hadzic I, et al. Automated anonymization of radiology reports: comparison of publicly available natural language processing and large language models. Eur Radiol 2025;35:2634-41. [Crossref] [PubMed]
Cite this article as: Mugu V, Sorin V, Chan A, Briggs C, Carr B, Olson M, Schupbach J, Zietlow J, Saddoughi S, Schmitz J, Khandelwal A. Feasibility of extracting key elements from thoracic surgical operative notes using foundational large language models. J Thorac Dis 2025;17(11):9470-9477. doi: 10.21037/jtd-2025-1318