Skip to main content
Uncategorized

Validation of a Zero-Shot Learning Natural Language Processing Tool for Data Abstraction from Unstructured Healthcare Data

Preprint, compiled August 2, 2023

Basil Kaufmann ID 1, 2, Dallin Busby ID 1, Chandan Krushna Das ID 1, Neeraja Tillu ID 1, Mani Menon1, Ashutosh K. Tewari ID 1, and Michael A. Gorin ID 1

1Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, USA 2Department of Urology, University Hospital of Zurich, Switzerland

Abstract

Objectives: To describe the development and validation of a zero-shot learning natural language processing (NLP) tool for abstracting data from unstructured text contained within PDF documents, such as those frequently found within electronic health record systems. Materials and Methods: A data abstraction tool based on the GPT-3.5 model from OpenAI was developed and compared to three physician human abstractors in terms of time to task completion and accuracy for abstracting data on 14 unique variables from a set of 199 de-identified radical prostatectomy pathology reports. The reports were processed by the software tool in vectorized and scanned formats to establish the impact of optical character recognition on data abstraction. The tool was assessed for superiority for data abstraction speed and non-inferiority for accuracy. Results: The human abstractors required a mean of 101 seconds (95% CI, 97 to 104 seconds) per report for data abstraction, with times varying from 15 to 284 seconds. In comparison, the software tool required a mean of 12.8 second (95% CI, 12.3 to 13.3 seconds) to process the vectorized reports and a mean of 15.8 seconds (95% CI 15.1 to 16.5 seconds) to process the scanned reports (P < 0.001 for all paired comparisons). The overall accuracies of the three human abstractors were 94.7% (95% CI, 93.8 to 95.5%), 97.8% (95% CI, 97.2 to 98.3%), and 96.4% (95% CI, 95.6 to 97%) for the combined set of 2786 datapoints. The software tool had an overall accuracy of 94.2% (95% CI, 93.3 to 94.9%) for the vectorized reports, proving to be non-inferior to the human abstractors at a margin of -10% (α=0.025). The tool had a slightly lower accuracy of 88.7% (95% CI 87.5 to 89.9%) using the scanned reports, proving to be non-inferiority to 2 out of 3 human abstractors. Conclusion: The developed zero-shot learning NLP tool affords researchers comparable levels of accuracy to that of human abstractors, with significant time savings benefits. Because of the lack of need for task-specific model training, the developed tool is highly generalizable and can be used for a wide variety of data abstraction tasks, even outside the field of medicine.

Keywords data abstraction · natural language processing · large language models

1 Introduction

The advent of electronic health records(EHRs) has transformed clinical research by offering access to vast amounts of patient data. An essential part of EHRs are free-text fields. Unlike their structured counterparts, free-text fields offer a nuanced and comprehensive perspective on individual patient cases, capturing a depth of clinical information that is often not found in structured data. With that said, the use of unstructured data in clinical research is fraught with challenges[1–3]. Chief among them is the labor-intensive task of data extraction. Addition-ally, the lack of standardization in free-text entries, due to their subjective and individualistic nature, complicates data aggregation and comparison across different records or healthcare providers. Furthermore, inconsistencies in the quality and completeness of data further heighten this challenge for researchers. This has prompted the need for sophisticated natural language processing (NLP) techniques for data abstraction, but this has the downside of introducing the potential for misinterpretation or omission of vital information[4–7].

Recent advancements in artificial intelligence technologies, particularly the introduction of large language models (LLMs) with zero-shot learning capabilities, offer a promising solution to the abstraction of unstructured healthcare data[8–12]. Zero-shot learning is a concept in machine learning where a model is able to accurately classify data into categories that it has not previously encountered during its training phase. Thus, zero-shot learning makes it possible to forego the highly technical and time-consuming work of first training a model for a given data abstraction task.

Herein, we describe the development of an LLM-based tool that utilizes zero-shot learning to abstract unstructured healthcare data contained within Portable Document Format (PDF) files. Following successful development of the tool, we benchmarked its performance in terms of time to task completion and accuracy for abstracting data from a set of radical prostatectomy pathology reports. We compare the performance of this tool to three physician data abstractors.

2 Material and Methods

Software Development. A zero-shot learning tool for data abstraction from PDF documents was developed using Python programming language version 3.10. The completed tool incorporates functions from several third-party libraries including PyMuPDF for text extraction [13], TensorFlow Hub (Google DeepMind, London, UK) for loading a universal sentence encoder for creating semantic embedding from the extracted data[14], Scikit-learn for performing nearest neighbor searches[15], and an advanced programming interface (API) from OpenAI (San Francisco, CA) for answer generation[16]. The software relies on the use of an API key provided by OpenAI that serves as an authentication token for accessing the “text-davinci-003” LLM. This model is based on GPT-3.5 and has been optimized for instruction-following tasks such as data abstraction[17].

Figure1providesagraphicrepresentationofthestepsfollowed by the software to perform the data abstraction task. First, the program extracts text from one or more PDF files specified by the user and removes any unwanted characters or white spaces. The extracted text is next divided into smaller chunks and processed by the sentence encoder model [14]. This step generates numerical embeddings of the text and creates an index of the data using a nearest neighbor algorithm. The software then performs a semantic search of the embedded data and finds relevant chunks of text based on the user’s input question. Next, a prompt is formulated for the text-davinci-003 model, incorporating the relevant text segments and the user’s question of interest. The API returns the model’s response, and this text is cleaned and placed in an Excel spreadsheet (Microsoft Corporation, Redwood, WA) for down stream data cleaning, standardization, and analysis.

Because the developed tool uses zero-shot learning for data abstraction, no training steps are required for its implementation. The user is only required to provide a prompt written in plain English to query against a provided set of PDF documents. For example, a researcher may have a series of articles from which they are interested in abstracting the name of each article’s first author. The user can prompt the tool to ‘Query the input files and return the name of the first author of each article.’ The result of this will be an Excel file with a column containing author names and an identifier linking the row to its respective input PDF file. The code underlying our tool is publicly available on GitHub (https://github.com/kaufmannb/PDF-Extractor). In addition, we have made available an executable file that re-searchers without computer coding experience can use immediately for data abstraction (https://rebrand.ly/5sxu4rn).

Datasets. The performance of the developed NLP tool was evaluated using a publicly available set of 199 de-identified radical prostatectomy pathology reports from The Cancer Genome Atlas (TCGA) program [18]. Although provided in PDF for-mat, these pathology reports were previously printed and then scanned by the submitting sites. To mitigate potential errors introduced by optical character recognition (OCR), the reports were first converted back to a “vectorized” format as they orig-inally appeared in the EHR. This was performed by export-ing each document into .docx format using Adobe Acrobat Pro (Adobe, Inc., San Jose, CA) and then manually correcting any introduced errors in Microsoft Word. The corrected reports were saved back into PDF format, maintaining the integrity of the original formatting and layout (Supplementary Figure 1). These vectorized reports served as the input for our main round of data abstraction by the NLP tool. Additionally, a secondary analysis was performed using the original scanned reports to quantify the impact of OCR errors on the tool’s performance.

Figure 1: Workflow of the developed zero-shot learning NLP tool for data abstraction from unstructured text contained with PDF documents. Multiple PDF files (such as scanned pathology reports), are processed from a designated input folder and converted into text. Based on the user’s question, the tool applies an AI algorithm to isolate specific information. Finally, it outputs an Excel file containing the cleaned and extracted data into a predetermined output folder.

Assessment of the Data Abstraction Tool. Following completed development of the data abstraction tool and generation of the testing datasets, we benchmark its performance against three human data abstractors (all of whom are physicians) in terms of time to task completion and accuracy of data abstraction. The human abstractors and software tool were asked to perform the identical task of abstracting 14 variables from the pathology reports (Table 1). These variables were selected based on their clinical relevance and alignment with the protocol for examination of radical prostatectomy specimens by the College of American Pathologists [19].

Responses from the three human abstractors were collected using a custom-built website that displayed each pathology report in a PDF viewer, accompanied by an input form designed to capture the abstracted data. Each variable featured a specific set of dropdown answer choices, ensuring that only predefined answers could be selected, thereby precluding the possibility of

Table 1: List of abstracted variables.

random or extraneous responses. Responses, along with processing times, were logged for each abstracted report.

To extract data using the NLP tool, the following prompt was utilized: “Complete the following list of variables with the corresponding values extracted from the given pathology re-port: pT-Stage, Primary Gleason Grade, Secondary Gleason Grade, Gleason Sum Score, Tertiary Gleason Pattern or Grade, Extraprostatic Extension, Seminal Vesical Invasion, Lympho-vascular Invasion, Perineural Invasion, Surgical Margin Status, pN-Stage, Number of Lymph Nodes Removed, Number of Lymph Nodes Involved by Cancer, Specific Prostate Weight in g.” Additionally, the program was instructed to “Compose a comprehensive reply to the query using the search results given. If the search results mention multiple subjects with the same name, create separate answers for each. Only include information found in the results and don’t add any additional information. Make sure the answer is correct and don’t output false content. If the text does not relate to the query, simply state ‘Found Nothing’. Ignore outlier search results which has nothing to do with the question. Only answer what is asked. The answer should be short and concise.”

The mean time required for data abstraction was calculated for the three human abstractors as well as the NLP tool using the two sets of PDF reports. Similarly, the accuracy of data abstraction was calculated for these groups in terms of overall accuracy for the combined set of datapoints (14 variables x 199 reports = 2,786 datapoints) and at the individual variable level. For this analysis, the ground truth (i.e. the correct answer for each data point) was established from the individual responses provided by the human abstractors. The ground truth response was defined when there was agreement between at least two of the data abstractors. In cases with any degree of disagreement among the abstractors, the report was re-examined and if a clear error was identified on part of the dissenting abstractor, the majority consensus was upheld as the ground truth. However, if the re-examination did not reveal an obvious error, further dis-

Figure 2: Violin plot showing the distribution of times required for data abstraction.

cussions between the abstractors were held until consensus was reached. In instances where agreement between the three abstractors could not be achieved, a fourth reviewer was involved.

Statistical Analysis. Differences in data abstraction times were assessed for statistical significance using the paired Student’s t-test. Comparisons were separately made between the NLP tool and each human abstractor for the two sets of PDF reports. Additionally, the performance of the NLP was compared to itself in a paired fashion using the two datasets. For comparisons of superiority, a P-value of <0.05 was considered statistically significant. A similar paired analysis was performed to assess the accuracy of data abstraction using the McNemar’s test of paired proportions. The accuracy of the NLP tool was assessed for non-inferiority to the human abstractors using a non-inferiority margin of -10% (α = 0.025). Data analysis and visualization was performed using R version 4.0.2 (R Foundation for Statistical Computing).

3 Results

As a group, the human abstractors required a mean of 101 seconds (95% CI, 97 to 104 seconds) per report for data abstrac-tion,withtimesvaryingfrom15to284seconds. Assessed individually, the three human abstractors had data collection times of 104 seconds (95% CI 99 to 109 seconds), 94 seconds (95% CI89 to 100 seconds), and 103 seconds (95% CI96 to 110), respectively. In comparison, the software tool required a mean of 12.8 second (95% CI, 12.3 to 13.3 seconds) to process the vectorized reports and a mean of 15.8seconds (95% CI15.1 to 16.5 seconds) to process the scanned reports (P < 0.001 for all paired comparisons). The NLP tool required significantly less time to process the vectorized reports than the scanned documents (P < 0.001). Our observations with respect to data abstraction times are highlighted in Figure 2.

Compared to the ground truth, the overall accuracies of the three human abstractors were 94.7% (95% CI, 93.8 to 95.5%), 97.8% (95% CI, 97.2 to 98.3%), and 96.4% (95% CI, 95.6 to 97%) for the combined set of 2786 datapoints. In contrast, the software tool achieved an overall accuracy of 94.2% (95% CI, 93.3 to 94.9%) for the vectorized reports and 88.7% (95% CI 87.5 to 89.9%) for the scanned reports. The overall accuracy of the tool for abstracting data from the vectorized reports proved to be non-inferior to each of the three human abstractors (Supplementary Figure 2). For the scanned reports, the tool was non-inferior to 2 out of 3 human abstractors. A significantly higher degree of accuracy was achieved by the tool for the task of overall data abstraction when using the vectorized reports (Supplementary Figure 3).

Two of the three human abstractors achieved an accuracy of >90%foreachofthe14individualvariables. The third abstractor achieved this level of accuracy for 12 out of 14 variables. In contrast, the software achieved >90% accuracy for 11 variables using the vectorized reports whereas this level of accuracy was achievedforonly8parameterswhen using the scanned reports.

Figure 3 provides a summary of the accuracies achieved by the three human abstractors as well as the NLP tool using both datasets.

4 Discussion

The present study explored the time efficiency and accuracy of a zero-shot learning NLP tool for abstracting data from PDF pathology reports of radical prostatectomy specimens. Our data show that the tool is capable of achieving a 7-fold increase in processing speed compared to human abstractors while maintaining an overall abstraction accuracy of 94%. Notably, the tool was non-inferior to the three humans at a -10% margin for the overall task of abstracting 2,786 datapoints from vectorized PDF documents. Although the tool performed slightly worse using the dataset of scanned reports, non-inferiority was achieved relative to two out of the three human abstractors. The performance of the software tool varied across the 14 variables examined,reachinganaccuracyofover90%accuracyin11out of 14 variables. However, the tool struggled with certain vari-ables such as “Tertiary Gleason Pattern” or “Number of Lymph Nodes Removed,” achieving accuracies in the range of 83%.

Previous research has employed various machine learning algorithms to parse non-structured data from EHR systems [6, 9, 20–29]. However, these tools have uniformly required complex supervised training methods. This requirement is highly labor intensive and limits the number of variables one can practically extract. Furthermore, the reliance on supervised and semi-supervised learning methods limits generalizability of tools to unfamiliar sources of data. These limitations were evident in a study by McCowan et al. [23] who used a rule-based and support vector machine classifiers to extract information on tumor (T) and nodal (N) stages from lung cancer pathology re-ports. This study required a training set of 710 reports and yet only achieved accuracy levels of 74% and 86% for T and N stage classification, respectively. Abedian et al. [20] looked the broaden the capability of NLP tools for abstracting cancer staging data from a wider range of cancer types, but this required and enormous training data from 555,681 pathology reports and was still only limited to four outputs (ICD10 code and T, N, and metastasis stage). While other more advanced tools, such as DeepPhe, have been specifically engineered to widen the breath of data abstraction capabilities, such tools still require supervised training methods and/or have seen only limited validated in a handful of cancer types [26].

Compared to the tools mentioned above, our method utilizes a zero-shot learning strategy, which requires no task-specific training. This gives our tool the capacity to abstract data on any topic from a body of PDF documents. As urologists, the application described in this report was based on clinical data abstraction from radical prostatectomy reports, but the developed tool could be employed for any number of applications within the field of medicine. Additionally, our tool has the potential be used across a broad range of industries that relay on insights from data contained in free-text format.

For the development of the software, we specifically utilized the publicly available text-davinci-003 model from OpenAI’s GPT3 family. Compared to other iterations of GPT-3, the model used in our study was previously shown to excel in tasks like Named Entity Recognition (NER) to identify and categorize key information from unstructured text [17]. This study also showed that the text-davinci-003 model was particularly well suited for interpreting complex sentences and phrases, as shown in tasks like Part-of-speech Tagging [30] and Semantic Match-ing [31].

The prompt crafted by the user and sent to text-davinci-003 model is crucial for achieving the desired output. GPT is highly sensitive to the form and choice of wording, and there are even specialists, known as “prompt engineers”, who excel in this task [32, 33]. An example of this can be found in a recent study that evaluated the specific manner in which multiple choice questions are presented to an LLM can significantly impact the ac-curacy of response [34]. These authors found that simply asking multiple choice questions of the model in the usual manner resulted in an accuracy of 55.8% when indeterminate responses were censored and 36.1% when indeterminate responses were included on the United States Medical Licensing Examination Step 1. When the users next asked the model to provide a ratio-nale for each answer choice, the accuracy increased to 64.5% when indeterminate responses were censored and 41.2% when indeterminate responses were included.

While our model was highly accurate for the abstraction of 11 of the selected variables, more modest results were achieved for the remaining 3. As described by Zheng et al. [35], the majority of incorrect responses from LLMs can be grouped into four categories of error: (1) comprehension, (2) factualness, (3) specificity, and (4) inference. In errors of comprehension, the model does not fully understand the context or the intent of the question or the information in the report, whereas for errors of factualness the model recalls incorrect or outdated information. In terms of specificity, these errors occur because the model does not provide enough information that is relevant to the question to derive a useful answer. For example if it should extract the Gleasonscore8,itanswerswith“high”insteadofprovidingthe specific number. Finally, for inference errors, the model might have all the necessary information to answer a question, but it failsto“reason”withthefactseffectivelytoarriveatthecorrect answer.

Figure 3: Accuracy of data abstraction by the three humans and NLP tool.

The tool’s performance is not solely dependent on the GPT component or prompt generation. In the application selected for our tool, parsing PDFs is a critical aspect of the process. Various Python libraries are available for PDF parsing, but their efficacy can vary, potentially leading to data loss or errors during the conversion from PDF to text. Particularly, data in complex layouts or tabular formats might be misinterpreted. We found in our sub-analysis of the scanned reports, that OCR made a difference in overall accuracy of 5.5%, leading to a decrease in performance of the NLP tool to 88.7%.

Despite the challenges mentioned, we feel the developed tool represents a significant advancement in the ability for re-searchers to abstract data without the need for individual NLP model creation or training. We feel that if vectorized PDFs or improved OCR and text extraction methods are utilized with carefully crafted prompts specific to individual variables, the error rate achieved would balance well against the gain of 7-fold improvement in abstraction efficiency, freeing human abstractors to perform more critical research tasks. We should note, however, that the tool developed for this research report is not quite ready for immediate use in clinical practice, as it does not include end-to-end encryption which is required to meet the standards of the United States Health Insurance Portability and Accountability Act (HIPAA) or the General Data Protection Regulation (GDPR) of the European Union.

5 Conclusion

The presented zero-shot learning NLP tool affords researchers comparable levels of accuracy to that of human abstractors, with significant time savings benefits. Because our tool does not require task-specific training, it can be adopted for a limitless number of data abstraction tasks, including applications outside the field of medicine. Continued refinement, particularly in terms of optical character recognition (OCR) capabilities, may enhance its practical use, speed, and degree of accuracy.

6 Acknowledgements

The research was supported by a grant from the Swiss Cancer League (KrebsligaSchweiz), Swiss Urological Society, and the Menon Family Foundation.

References

[1] Maryam Y Garza, Tremaine Williams, Sahiti Myneni, Susan H Fenton, Songthip Ounpraseuth, Zhuopei Hu, Jeannette Lee, Jessica Snowden, Meredith N Zozus, and Anita C Walden. Measuring and controlling medical record abstraction (mra) error rates in an observational study. BMC Medical Research Methodology, 22(1):227, 2022. ISSN 1471-2288.

[2] Hyoun-Joong Kong. Managing unstructured big data in healthcare system. Healthcare informatics research, 25 (1):1–2, 2019.

[3] Brock Polnaszek, Andrea Gilmore-Bykovskyi, Melissa Hovanes, Rachel Roiland, Patrick Ferguson, Roger Brown, and Amy JH Kind. Overcoming the challenges of unstructured data in multi-site, electronic medical record-based abstraction. Medical care, 54(10):e65, 2016.

[4] Hadeel Alzoubi, Raid Alzubi, Naeem Ramzan, Daune West, Tawfik Al-Hadhrami, and Mamoun Alazab. A re-view of automatic phenotyping approaches using elec-tronic health records.      Electronics, 8(11):1235, 2019. ISSN 2079-9292.

[5] Benjamin T Miller, Aldo Fafaj, Luciano Tastaldi, Hemasat Alkhatib, Samuel Zolin, Raha AlMarzooqi, Chao Tu, Diya Alaedeen, Ajita S Prabhu, and David M Krpata. Capturing surgical data: Comparing a quality improvement registry to natural language processing and manual chart review. Journal of Gastrointestinal Surgery, 26(7):1490–1494, 2022. ISSN 1091-255X.

[6] Guergana K Savova, Ioana Danciu, Folami Alamudun, Timothy Miller, Chen Lin, Danielle S Bitterman, Geor-gia Tourassi, and Jeremy L Warner. Use of natural lan-guage processing to extract clinical cancer phenotypes fromelectronicmedicalrecords. Cancerresearch,79(21): 5463–5470, 2019. ISSN 0008-5472.

[7] Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli, Fabio Rinaldi, and Venet Os-mani. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR medical infor-matics, 7(2):e12239, 2019.

[8] Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S Rosen, Gerbrand Ceder, Kristin Persson, and Anubhav Jain. Structured information extraction from complex scientific text with fine-tuned large language models. arXiv preprint arXiv:2212.05238, 2022.

[9] Sami-Ramzi Leyh-Bannurah,    Zhe Tian, Pierre I Karakiewicz, Ulrich Wolffgang, Guido Sauter, Mar-git Fisch, Dirk Pehrke, Hartwig Huland, Markus Graefen, and Lars Budäus. Deep learning for natural language processing in urology: State-of-the-art automated ex-traction of detailed pathologic prostate cancer data from narratively written electronic health records. JCO clinical cancer informatics, 2:1–9, 2018. ISSN 2473-4276.

[10] Bernardino Romera-Paredes and Philip Torr. An embar-rassingly simple approach to zero-shot learning. In In-ternational conference on machine learning, pages 2152– 2161. PMLR.

[11] Xiaohong Sun, Jinan Gu, and Hongying Sun. Research progress of zero-shot learning. Applied Intelligence, 51: 3600–3614, 2021. ISSN 0924-669X.

[12] Wen-wai Yim, Meliha Yetisgen, William P Harris, and Sharon W Kwan. Natural language processing in on-cology: a review. JAMA oncology, 2(6):797–804, 2016. ISSN 2374-2437.

[13] Pymupdf.  version      1.22.3.        retrieved      from https://github.com/pymupdf/pymupdf, .

[14] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, and Chris Tar. Univer-sal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.

[15] Scikit-learn. version 1.2.2. retrieved from https://scikit-learn.org/stable/. .

[16] Openai python library. version 0.27.7. retrieved from https://pypi.org/project/openai/. .

[17] Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, and Yang Shen.                    A comprehensive capability analy-sis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420, 2023.

[18] National Cancer Institute. The cancer genome atlas pro-gram (tcga).

[19] G Paner, J Srigley, Jason Pettus, Giovanna Angela Gian-nico, Joseph Sirintrapun, and Lara R Harik. Protocol for the examination of radical prostatectomy specimens from patients with carcinoma of the prostate gland. College of American Pathologists (CAP): Northfield, IL, USA, 2021.

[20] Sajjad Abedian, Evan T Sholle, Prakash M Adekkanattu, MarikaMCusick,StephanieEWeiner,JonathanEShoag, Jim C Hu, and Thomas R Campion Jr. Automated ex-traction of tumor staging and diagnosis information from surgical pathology reports. JCO Clinical Cancer Infor-matics, 5:1054–1061, 2021. ISSN 2473-4276.

[21] Alexander P Glaser, Brian J Jordan, Jason Cohen, Anuj Desai, Philip Silberman, and Joshua J Meeks. Automated extraction of grade, stage, and quality information from transurethralresectionofbladdertumorpathologyreports using natural language processing. JCO Clinical Cancer Informatics, 2:1–8, 2018. ISSN 2473-4276.

[22] B. J. Kim, M. Merchant, C. Zheng, A. A. Thomas, R. Contreras, S. J. Jacobsen, and G. W. Chien. A natu-ral language processing program effectively extracts key pathologic findings from radical prostatectomy reports. J Endourol, 28(12):1474–8, 2014. ISSN 0892-7790. doi: 10.1089/end.2014.0221.

[23] Iain A McCowan, Darren C Moore, Anthony N Nguyen, Rayleen V Bowman, Belinda E Clarke, Edwina E Duhig, and Mary-Jane Fry.    Collection of cancer stage data by classifying free-text medical reports. Journal of the American Medical Informatics Association, 14(6):736– 745, 2007. ISSN 1527-974X.

[24] Iain McCowan, Darren Moore, and Mary-Jane Fry. Clas-sification of cancer stage from free-text histology reports. In 2006 International Conference of the IEEE Engineer-ing in Medicine and Biology Society, pages 5153–5156. IEEE. ISBN 1424400325.

[25] Anthony N Nguyen, Michael J Lawley, David P Hansen, Rayleen V Bowman, Belinda E Clarke, Edwina E Duhig, and Shoni Colquist. Symbolic rule-based classification of lung cancer stages from free-text pathology reports. Jour-nal of the American Medical Informatics Association, 17 (4):440–445, 2010. ISSN 1527-974X.

[26] G. K. Savova, E. Tseytlin, S. Finan, M. Castine, T. Miller, O. Medvedeva, D. Harris, H. Hochheiser, C. Lin, G. Cha-van, and R. S. Jacobson.                  Deepphe: A natural lan-guage processing system for extracting cancer pheno-types from clinical records. Cancer Res, 77(21):e115– e118, 2017. ISSN 0008-5472 (Print) 0008-5472. doi: 10.1158/0008-5472.Can-17-0615.

[27] Jeremy L Warner, Mia A Levy, Michael N Neuss, Jeremy L Warner, Mia A Levy, and Michael N Neuss. Recap: feasibility and accuracy of extracting cancer stage information from narrative electronic health record data. Journalofoncologypractice,12(2):157–158,2016. ISSN 1554-7477.

[28] Adam Yala, Regina Barzilay, Laura Salama, Molly Grif-fin, Grace Sollender, Aditya Bardia, Constance Lehman, Julliette M Buckley, Suzanne B Coopey, and Fernanda Polubriaginof. Using machine learning to parse breast pathology reports. Breast cancer research and treatment, 161:203–211, 2017. ISSN 0167-6806.

[29] Sicheng Zhou, Nan Wang, Liwei Wang, Hongfang Liu, and Rui Zhang. Cancerbert: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. Journal of the Ameri-can Medical Informatics Association, 29(7):1208–1216, 2022. ISSN 1527-974X.

[30] Helmut Schmid. Part-of-speech tagging with neural net-works. arXiv preprint cmp-lg/9410018, 1994.

[31] Jyun-YuJiang,MingyangZhang,ChengLi,MichaelBen-dersky, Nadav Golbandi, and Marc Najork. Semantic text matching for long-form documents. In The world wide web conference, pages 795–806.

[32] Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023. ISSN 0028-4793.

[33] Jules White, Quchen Fu, Sam Hays, Michael Sand-born, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pat-tern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.

[34] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, and James Maningo. Performance of chatgpt on usmle: Po-tential for ai-assisted medical education using large lan-guage models. PLoS digital health, 2(2):e0000198, 2023. ISSN 2767-3170.

[35] Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in answering questions faith-fully? arXiv preprint arXiv:2304.10513, 2023.

7 Supplementary Material

Supplementary Figure 1: Example of a scanned (A) and vectorized (B) de-identified radical prostatectomy pathology reports.

Supplementary Figure 2: Non-inferiority analysis of accuracy of the NLP tool versus human abstractors using vectorized (A) and scanned reports (B). Lines represent 97.5% confidence intervals of differences in proportions of accuracy, with points marking the mean difference. The blue dashed line denotes the -10% non-inferiority margin.

Supplementary Figure 3: Non-inferiority analysis of accuracy of the NLP tool using vectorized versus scanned reports. Lines represent 97.5% confidence intervals of differences in proportions of accuracy, with points marking the mean difference.