Abstract
In this work, we present our strategy for developing domain-specific large language models which cover the vocabulary of the target domain and train on reliable sources of clinical information. Prostate cancer was chosen as a use-case for this study. We collected more than 1.8 million clinical notes and radiology and pathology reports for 15341 patients treated for prostate cancer in Mayo Clinic across three sites and outpatient clinics. In addition to domain-specific training data, we built domain-specific tokenizers and devised knowledge-guided training strategies for LLM development. During the self-supervised training, LLM was forced to predict domain-specific information by marking clinical terms using UMLS parser. We evaluated the model for downstream tasks of clinical information prediction and question answering using quantitative and user evaluation study to measure the accuracy, reliability and information completeness. We compared the domain-specific model against similarly sized general purpose model GPT-2 and a three-times larger domain specialized model. i.e., BioGPT. Our model outperformed GPT-2 on both tasks by a wide margin. Our model was also able to outperform BioGPT on clinical information prediction tasks and showed some advantages over BioGPT in question-answering tasks.
Introduction
Prostate cancer is the most commonly diagnosed non-skin cancer among men in the United States with a 5-year survival rate of about 100% for local or regional cancer1,2. However, the diagnosis of prostate cancer is often followed by immediate decline in mental and physical health3,4. Limited knowledge regarding the disease, embarrassment around physical examination such as rectal exam and discussion of sexual symptoms with healthcare professionals, potentially exacerbated depending on the provider gender5,6, are known causes of poor outcome among older patients and/or those belonging ethnic minorities 7–9. These uncertainties and anxiety related to sexual symptoms could lead patients to search for information through online search engines in which the patient may end-up with either incomplete, contradictory, misleading, and/or inaccurate information10. This, in turn, could cause further delay in treatment and poor outcomes.
During diagnosis, treatment planning, and post-treatment disease management, vast amounts of clinical information is generated and recorded by the care team about prostate cancer patients. This is usually in free-text form including clinical notes and radiology and pathology reports. Current developments in the field of large language modeling provide a unique opportunity to process this information for developing reliable agents for interactive knowledge sharing in a discreet way11–13. However, we hypothesize that general purpose LLMs, like GPT, are at a disadvantage when processing this information as they have not been trained for any specific domain. They may generate generic responses to domain-specific questions. They also suffer from hallucination, which can be especially dangerous in the sensitive domains like clinical medicine13–16. Research effort has been put into domain-specific LLM development such as MedPALM17 (540B), MedAlpaca18 (13B), and BioGPT19 (1.5B). However, these models are also broad and intended to cover complete domains of medicine and biology. To cover such vast fields, these models are growing in sizes reaching on the order of 100 billions of parameters. Medicine includes subspecialties, with different domain specific knowledge such as oncology, preventive medicine, and emergency medicine where the language and treatments differ.
Large text corpora are generally used to train large models. GPT20 and LLaMA21 have been trained on text crawled from the world wide web without verifying the source reliability. Training of specialized LLMs like BioGPT and MedPALM have been focused on domain-specific knowledge corpora like millions of scientific abstracts from PubMed. However, patient records are arguably the most critical source of information to capture the domain of any disease. The records include several distinct documents including oncology notes, nursing notes, ED reports, discharge summaries, and radiology and pathology reports. While GPT2 and LLaMA, and even medicine-focused BioGPT, have been trained on extremely large amounts of information, their training data cannot include such patient-specific documents because of patient privacy requirements set by Health Insurance Portability and Accountability Act (HIPAA). We argue that this puts them at an inherent disadvantage in terms of understanding the clinical decision making process and information dissemination to the patients themselves. While research papers and expert treatment guidelines can impart knowledge about latest treatment options and their general pros and cons, clinical notes actually describe how physicians and patients may decide on a treatment option and their following outcomes. We argue that inclusion of clinical notes and reports is critical when building a language understanding and generation tool for any particular disease.
We present our work on development of a domain-specific LLM to overcome the shortcomings of general purpose LLMs when applied to sensitive domains like medical decision making and information dissemination. We chose prostate cancer as our specific use case. We chose a relatively smaller size for our LLM (i..e, 124M parameters) and collected and parsed prostate cancer related patients’ data including clinical notes and radiology and pathology reports for its training. We also devised domain-specific self-supervised training techniques to ensure that model learned the nuances of the chosen domain. We evaluated the proposed LLM against similarly sized general purpose LLM (GPT2-small) as well as larger domain-specific LLM (BioGPT) on two tasks; i) clinical information mask prediction, and ii) question-answering. We also measured the performance in a user study with 3 independent raters who evaluated the generated responses in terms of completeness, correctness and relevance.
Methodology
In this section, we describe several steps involved in the development and training of a domain-specific LLM. We argue that these steps form a framework that could be replicated to other domains like breast cancer. Figure 1 shows the overall framework for domain-specific LLM development and application.
Data curation
With the approval of the internal review board (IRB) at Mayo Clinic, we collected clinical notes concerning 23,665 patients treated for prostate cancer at all three major sites of Mayo Clinic (Rochetser, Arizona, Florida). Statistics of these patients are provided in Table 1. This data included not only all clinical notes and radiology and pathology reports. We developed rule-based NLP techniques to gather information about cancer characteristics of these patients such as lesion size and Gleason score. These patients were also part of an enterprise-wide cancer registry which recorded their clinical outcome information.
Text Parsing
Clinical data collected for the patients cohort included ∼ 1.8 million clinical notes (including oncology and all other notes) and 23665 radiology and pathology notes. This data was used for training of our domain-specific LLM after several steps of parsing and cleaning.
Anonymization
Clinical notes contain not only important clinical information about patient-provider interaction, they also contain identifiable information about both patients and providers. While this data was not intended for public release, we wanted to ensure patient and provider privacy was not violated even by sharing the weights of our domain specific language model. Large language models display behaviors like memorization which leaves open the possibility of unmasking identifiable patient information used during the training phase. Therefore, we anonymized all clinical notes prior to their use in training of the proposed LLM using a pretrained BiLSM model with CRF output layer 22. Information like patient and provider names, dates, location were replaced with special tokens like “[NAME]”, “[DATE]”, and “[LOC]”. Only imaging finding sections from radiology and pathology information were used for LLM training. These sections are devoid of any patient information. Electronic signatures of providers were dropped using regular expression to ensure provider privacy.
Filtering
Textual dataset cleaning steps included filtering of very short (< 3 words) and very long sentences (>100 words). Given the repetition practice, we also used fuzzy matching techniques to filter out sentences which are too similar to other sentences within the clinical notes for one patient. This is done to improve the quality of training data. Repetition of standard/templated sentences may skew the behavior of the trained LLM towards generating such templated sentences at inference time.
Clinical Information Marking
We wanted to ensure that domain-specific LLM was focused on cancer treatment, side-effects, outcomes and symptoms related information during the training phase. To achieve this, we first needed to identify domain-specific information from free-text clinical notes. We used the Unified Medical Language System (UMLS) metathesaurus made available by the National Institute of Health23. UMLS framework is designed to standardize medical vocabularies and ontologies. Text parsing tools designed on top of UMLS identify medical concepts from free-text and map them to unique identifiers (CUI). Concepts are categorized into 127 “semtypes”. We selected 36 semtypes relevant to treatment, symptoms, side effects, clinical status and anatomical details. Filtered text from clinical notes was divided into sentences and passed through UMLS parser one by one and relevant medical entities and their location in the sentences were stored. We used a python parser for UMLS designed on a type of medical named entity extractor for parsing24. Our processing indicated that 84% of the sentences from clinical notes included important medical concepts. Medical concepts have varying distribution in the data with 33% concepts with one occurrence and 0.3% with greater than 10000 occurrences. A total of 193,936 concepts were identified. We argue that extremely common terms (>10000) may act as stop words such that they have no distinctive capabilities to establish differences between different text samples. For example, the concept “prostate cancer”,, “lesion”, “prostate specific antigen”, “elevated PSA” would have little distinctive capabilities within a corpus of all prostate cancer patients.
Domain-specific Tokenizer
Modern LLMs rely on word-piece tokenizers that meet the challenge of out-of-vocabulary token by chopping up the word into smaller foundational pieces such as “carcinoma” may be chopped into “car_” and “_cinoma”. However, we argue that chopping up important domain specific terms into multiple tokens may hinder model’s focus on learning about the whole clinical terms and their context. A passage full of relevant medical information containing N words may be divided in M tokens when M>>N. In extreme cases, M may exceed the length of the context window causing the model to ignore some information. On the other hand, a domain-specific tokenizer may be able to preserve boundaries of medical concepts and entities, thus shortening the length of the relevant text and allowing for better attention within the textual context. Therefore, we decided to train a domain-specific wordpiece tokenizer with vocab size of 50 thousand which was similar in size to popular tokenizers such GPT-2 tokenizer (vocab size of 50257). The tokenizer was trained on filtered clinical text as described in the previous section.
Domain-specific LLM
We based the architecture of our LLM on popular 12-layer masked self-attention decoder based architecture similar to the one used by GPT-220. The model was composed of about 124M trainable parameters. LLM training on free-text in self-supervised fashion is a critical component of LLM development. Simple language based self-supervision pre-training tasks like the prediction of the next token seem enough for general purpose models like GPT to gain fundamental understanding of natural language. We built upon this line of research by introducing domain-specific self-supervision in addition to generic self-supervision. Our LLM was trained in two phases.
Phase I - General purpose language understanding
Under this phase, our LLM was trained on free text data of prostate cancer patients including clinical notes and radiology and pathology reports for the task of next token prediction. Token sequences for free-text were generated using our domain-specific tokenizer. We named the model trained under this task only as PCa-LLM.
Phase II - Domain specific language understanding
In phase II of training, we aimed for the model to learn clinical language specific to prostate cancer. As described earlier, we marked clinical terms in free-text data of prostate cancer patients using UMLS parser. In this phase of training, we used masked clinical token prediction as a self-supervised training task.
Each text snippet may contain more than one clinical term. Terms were chosen with 50% probability for masking. Note that one term may be split into multiple tokens. In our training scheme, all tokens belonging to that term were masked and the model was trained to predict masked tokens.
For example, the sentence “common side effects of hormone therapy may include shrinkage of testicles” included clinical terms of “hormone therapy” and “shrinkage of testicles”. If “hormone therapy” was selected randomly for marking, all tokens belonging to it ({“hormone”, “therapy”}) would be masked. In this case, the model received “common side effects of <mask> <mask> may include shrinkage of testicles” and was tasked with predicting “hormone” and “therapy”.
We argue that this training scheme allowed the domain-specific LLM to focus on domain-specific language in addition to learning correlation between treatments, side effects, symptoms, and cancer characteristics. Such correlations can be essential for domain specific downstream tasks like patient education or treatment recommendation. For such downstream tasks, the model must be able to predict clinical terms using the context provided in the input text. This phase of training essentially taught the model to do the same. We call the model that underwent both phases of training PCa-MLM.
Evaluation Tasks
After training, we evaluated the model on two downstream tasks; i) masked clinical information prediction, and ii) question-answering.
Masked clinical information prediction
This task was similar to phase II of training. We randomly selected clinical terms and masked them in a set of held-out sentences (text not used during the training phase). The model was tasked with retrieving masked terms from the vocabulary of its tokenizer. We evaluated the model through an evaluation metric for information retrieval systems, i.e., recall@K where K was varied between 1 and 10, based on ranked retrieval results generated by the model probability.
Clinical question-answering
One of the most valuable downstream applications for any LLM can be a question-answering framework that can allow the users to interact with the LLM in an interactive and flexible manner. With this in mind, we chose question-answering as the second evaluation task.
We manually curated questions from treatment guidelines set by the American Cancer Society for prostate cancer. We browsed through each section of the guidelines and defined question-answer pairs of the following broad categories.
Treatment path recommended for cancer at different stages/risk category
Treatment path recommended for cancer at different stages/risk category under certain conditions (older patient, patients in poor health, patients with certain comorbidities)
Side effects of treatment
Long term risks of treatment
Medication used in treatment
Comparative pros and cons of different treatment
About 300 question answer pairs were manually curated. We then employed open-source general purpose LLM, i.e., LLaMA (13B)21 within a generative framework to paraphrase each question 4 different ways. After filtering for nonsensical paraphrasing such as empty strings, we were left with 1,222 QA pairs. Randomly selected unique 35 QA pairs were kept in a holdout set for evaluation purposes after dropping paraphrases of the same questions.
Results
In this section, we describe the results of evaluative experiments. We selected two models as comparative baselines for our model, i.e., GPT2-small and BioGPT. Selected version of GPT2 was similarly sized as our model but lacked the advantages of domain-specific vocabulary, training data, and training tasks. BioGPT was about three times bigger and was designed to cover a vast domain of medicine as a whole, and thus, had domain-specific vocabulary. Comparison with this model was meant to establish benefits of fine-grained domain definition (cancer vs. generic medicine).
Domain vocabulary Coverage
We applied general purpose GPT-2 tokenizer, medicine focused BioGPT tokenizer, and our tokenizer on clinical terms extracted by UMLS parser. GPT2 tokenizer tended to chop up medical concepts like “prostatectomy” (GPT2 tokenizer: 2 tokens, Our tokenizer: 1 token), “diabetes mellitus” (GPT2 tokenizer: 4 tokens, Our tokenizer: 2 token), etc. BioGPT tokenizer showed better clinical vocabulary coverage for up to 3 tokens than our tokenizer. However, it still chopped prostate cancer related words like “genitourinary” (GPT2: 5 tokens, BioGPT: 5 tokens, Ours: 1 token) and “erectile dysfunction” (GPT2: 5 tokens, BioGPT: 3 tokens, Ours: 2 token). Figure 2 shows distribution of medical concepts chopped into token sets of varying length by the three different tokenizers. Even though GPT2 coverage is minimal for the clinical terms with lowest number of tokens <=1, BioGPT and our model have similar coverage of the medical terms.
Masked clinical information prediction
For masked clinical information retrieval, each model produced a probability estimate for every tokens in its vocabulary for prediction as replacement for the masked token. Hence, a ranked list of all tokens in vocabulary (sorted in descending order of probability estimate) was available for each masked token. Each model was evaluated to see if the correct token was found in top K tokens for each masked token (recall@K). The performance was aggregated for all masked tokens (Table 2). We evaluated two phases of our model, i.e., Phase I: PCa-LLM and Phase II: PCa-MLM. Both versions significantly outperformed both comparative models with significant margin at each K, even when the vocabulary coverage for BioGPT is similar to the domain specific model.
Question-answering
For evaluation of the question-answering task, we relied on a user study where 3 independent users were asked to evaluate the responses of each model on the basis of three criteria.
Correctness
if the generated response was correct according to the clinical guidelines; the response did not have to be complete or cover all aspect of ground truth answer;
Completeness
if the generated response was complete, covered all aspects of the groundtruth answer;
Relevance
if the generated response was relevant to the question; response may be incorrect or incomplete based on the clinical guideline;
While correctness and completeness evaluate the accuracy of the response, the third criteria was introduced to assess the model’s comprehension of the question. Models, especially general purpose models - GPT2, may generate generic responses if they are unable to understand the granularity of the question.
After de-identification of the model names, all users were asked to provide 1/0 response for each question for all criteria; 1 indicated that the criteria was met (complete, correct or relevant) while 0 indicated that the generated response did not meet the criteria (incomplete, incorrect, or irrelevant). Responses in each category by every user were summed over 35 questions. By aggregate opinion of users (median scores), our model outperformed both comparative models. Table 3 shows the results from our user study where the median value for all the criteria are rated higher for our model. User 3 considers BioGPT responses rated higher for correctness and completeness while still rating our model higher for relevance based on the generated responses being more relevant for the prostate cancer domain. Higher numbers indicate better performance. Table 4 shows sample answers generated by all three models.
Discussion
Even though general purpose huge LLMs have demonstrated amazing capabilities to generalize to a wide variety of domains as zero-shot frameworks, their performance may still suffer in terms of relevance or specificity to a sensitive domain and may “hallucinate” while being forced to generate critical information14,16,25. Training of these models require humongous text corpora usually curated from publicly available text context available on the world wide web. Privacy concerns make it impossible to train such models in private data which may be critical for medicine related fields. We present a framework for design and development of domain-specific LLM with prostate cancer chosen as a use-case. Our framework includes domain-specific customization steps for tokenizer design, data curation, and self-supervised two-phases training. We argue that development of domain-specific LLMs provides a systematic solution to the challenges faced in terms of deployment of huge general purpose LLMs in sensitive domains. While some research has been done to specialize LLMs for sensitive domains like medicine and biology, this line of research usually focused on curation of large domain-specific datasets17,19. BioGPT and MedPALM are two popular biomedical LLMs developed to support this line of research. These LLMs still cover vast fields of medicine and biology and their training has been focused on public datasets. These LLMs have absorbed the knowledge of vast sources of medicine related information but have had little exposure to the real clinical data and thus patient-specific treatment decisions. In comparison, our framework focuses on training the model on patient EHR data collected by healthcare institutions including clinical notes and radiology and pathology reports. Thus, the models trained under our framework will be exposed to unique characteristics of patient-physician exchange for the particular targeted domain and able to learn the nuance of the treatment management.
We trained a small sized LLM (124M parameters) for the particular use case of prostate cancer which not only outperformed similarly sized general purpose LLM (GPT-2 small) but also outperformed the three times larger specialized LLM (BioGPT) for two downstream tasks. Specialized tokenizer allows the LLM to understand and focus on domain-specific terms instead of chopping them up into small generic tokens (see Figure 2). GPT2 tokenizer chops up clinical terms in a larger number of tokens while our specialized tokenizer and BioGPT tokenizer can comprehend most of the clinical terms as a whole. As these tokenizers were designed for the fields of medicine and biology, it tends to describe medicine related terms in fewer tokens. It covers more clinical terms within 3 tokens splitting than or tokenizer. However, clinical terms which are more specific to prostate cancer(i.e., found more commonly in prostate cancer related clinical text) are better handled by our tokenizer than BioGPT tokenizer. These terms include “erectile dysfunction”, “primary malignant neoplasm”, “prostatic hypertrophy”, “cystolitholapaxy”, “cystourethroscopy”, and others. In addition, several prostate cancer related drug names are part of our specialized vocabulary including “abiraterone” and “docetaxel”.
For the task of masked clinical term prediction, it is intuitive that PCa-MLM (phase I and II of training) outperformed PCa-LLM (phase I training only) as phase II training was performed using a very similar self-supervision task. However, even PCa-LLM outperformed GPT-2 and BioGPT by wide margins which has been trained using generic self-supervision tasks of next token prediction. This speaks to the benefits of specialized domain-specific tokenizers. BioGPT tokenizer displayed somewhat better coverage of generic clinical vocabulary than our tokenizer (as shown in Figure 2) but the model still performed extremely poorly for prediction of domain-specific (prostate cancer related) clinical information retrieval.
One of the main applications of our specialized LLM can be patient education as an interactive question answering framework. Therefore, we tested the performance of the model on a question-answering task. Note that these questions were curated from the website of the American Cancer Society. Since this data is publicly available, it is likely to be included in the training data of GPT2 and BioGPT’s training corpus. On the other hand, our model was only trained on clinical data generated within the hospital. It was never trained explicitly on the textual guidelines. Despite this, our model outperformed GPT2 by a wide margin on all three evaluation criteria. Table 4 shows some interesting examples of the response. It is clear that GPT2 often generated generic or irrelevant responses. Even when answering questions about side effects (Question 5), it described generic side effects like “feeling weak and tired” which was not very specific to the treatment in question. BioGPT does not suffer from this problem. Its answers were mostly related to the question even when it failed to produce the correct answer. For example, it generated names of bisphosphonate drugs when the question was about side effects of bisphosphonates (question 3). In answer to question 1, it generated a treatment option which is available for cancer (focal therapy) but was not part of the ground truth answer. PCaMLM and BioGPT responses differed from each other for question 2 and 4, but were still relevant and somewhat appropriate for the question.
Our experiments clearly establish the benefits of developing and employing domain-specific LLMs for sensitive domains like medicine and healthcare, especially for building interactive knowledge dissemination tools. We plan to expand upon this line of research by developing patient and physician facing chatbots on top of our domain specific LLM for the use case of prostate cancer. While it will be challenging to curate training datasets, i.e., question-answer pairs, manually, we plan to build upon automatic question-answer generation frameworks26. This line of research will provide a solution to reliability and specificity issues faced by general purpose chatbot like ChatGPT in the field of medicine.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Footnotes
Some values were incorrectly presented in Table 1 in the previous version. The numbers have now been fixed. Overall size of the cohort remains the same.