Abstract
Background Extracting structured data from clinical notes is a key bottleneck in developing AI tools for radiology and pathology. Manual annotation is labor-intensive and unscalable. An efficient, automated method for clinical information extraction with human-level performance is urgently needed.
Purpose To introduce a low-code open-source library, Strata, that facilitates fine-tuning, evaluating, and deploying large language models (LLMs) for structured data extraction from clinical reports.
Materials and Methods Four clinical datasets were annotated by a trained human annotator: 431 prostate MRI reports, 978 breast pathology reports, 238 kidney pathology reports, and 724 myelodysplastic syndrome (MDS) pathology reports. Datasets were split into training, development, and test sets. A second reader annotated the test set. We fine-tuned open-source LLMs (Llama-3.1 8B, Mistral-v0.3 7B) to extract variables from clinical reports. We evaluated zero-shot and fine-tuned open-source models, zero-shot GPT-4, and the second human annotator using exact match accuracy. Exact match accuracy assesses if all variables for a report were extracted correctly.
Results The second human annotator obtained exact match accuracies of 83.1 (95% CI 75.4, 88.5), 75.5 (95% CI 70.4, 80.3), 70.8 (95% CI 59.7, 80.6), and 85.8 (95% CI 80.7, 89.9) in the prostate, breast, kidney, and MDS test sets, respectively. Our fine-tuned Llama-3.1 8B model achieved human-level performance across all fine-tuning settings with exact-match test-set accuracies of 91.5 (95% CI 85.4, 95.4), 91.5 (95% CI 88.1, 94.2), 72.2 (95% CI 61.1, 81.9), and 89.4 (95% CI 84.9, 93.1) for prostate, breast, kidney, and MDS reports, respectively. We found that ≤ 100 training reports were needed to achieve human-level performance on these tasks.
Conclusion Strata enables automated human-level performance in extracting structured data from clinical notes using ≤ 100 training reports.
Summary Statement Fine-tuned open-source LLMs achieved human-level performance in extracting structured outcomes from prostate MRI, breast pathology, kidney pathology, and myelodysplastic syndrome pathology reports, supporting diverse research applications.
Key Results
- Fine-tuned Llama-3.1 8B obtained human-level exact match accuracies in all four test sets.
- 100, 95, 43, and 72 training reports from the breast pathology, kidney pathology, prostate MRI, and myelodysplastic syndrome pathology datasets were needed for fine-tuned Lamma-3.1 8B to achieve human-level performance.
- Each fine-tuned model required 1-3 hours of training on a single A40 GPU, costing between $0.80 to $2.40 on a private cloud.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Adam Yala is supported by the E.P. Evans Foundation and Breast Cancer Research Foundation awards. Alexander G. Bick is supported by NIH grants DP5 OD029586, a Burroughs Wellcome Fund Career Award for Medical Scientists, the E.P. Evans Foundation, and a Pew-Stewart Scholar for Cancer Research award, supported by the Pew Charitable Trusts and the Alexander and Margaret Stewart Trust. Maggie Chung is supported by the Radiological Society of North America Research Scholar Grant. Funders did not directly influence or partake in this study.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Human Research Protection Program of University of California, San Francisco, waived ethical approval of this Health Insurance Portability and Accountability Act compliant study due to minimal risk and retrospective nature of the study with no subject contact.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
↵* Joint first authors
↵^ Joint senior authors
Funding information: AY is supported by the E.P. Evans Foundation and Breast Cancer Research Foundation awards. AGB is supported by NIH grants DP5 OD029586, a Burroughs Wellcome Fund Career Award for Medical Scientists, the E.P. Evans Foundation, and a Pew-Stewart Scholar for Cancer Research award, supported by the Pew Charitable Trusts and the Alexander and Margaret Stewart Trust. MC is supported by the Radiological Society of North America Research Scholar Grant. Funders did not directly influence or partake in this study.
Data sharing statement: We release Strata, our LLM-based library, as an open-source tool on GitHub. Strata supported all experiments reported in this study. The clinical reports used in this study contain protected health information and cannot be shared due to patient privacy concerns.
Data Availability
All data produced in the present work are contained in the manuscript.