Abstract
Generalized language models that pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate a development of a clinical specific BERT model with a huge size of Japanese clinical narrative and evaluated it on the NTCIR-13 MedWeb that has pseudo-Twitter messages about medical concerns with eight labels. Approximately 120 millions of clinical text stored at the University of Tokyo Hospital were used as dataset. The BERT-base was pre-trained with the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT tends to show higher performances on the MedWeb task than the other nonspecific BERTs, however, no significant differences were found. The advantage of training on domain-specific texts may become apparent in the more complex tasks on actual clinical text, and such corpus for the evaluation is required to be developed.
Introduction
In recent years, generalized language models that perform pre-training on huge corpus have achieved great performance on a variety of natural language task. These language models are based on the transformer architecture that is a novel neural network based solely on a self-attention mechanism [1]. Some models such as Bidirectional Encoder Representations from Transformers (BERT) [2], Transformer-XL [3], XLNet [4], RoBERTa [5], XLM [6], GPT [7], GPT-2 [8] have been developed and updating state of the art results. It is considered that corpus for pre-training preferred to use the same domain as the target task. In the fields of life science or clinical medicine, domain specifically pre-trained model such as Sci-BERT [9], Bio-BERT [10], and Clinical-BERT [11], have been published for English texts. As for the Clinical-BERT, a study that domain specifically pre-training yields performance improvements on the tasks of common clinical Natural language processing (NLP) as compared to nonspecific model has been reported.
While many pre-trained transforms for English are published, few models are available for Japanese text, especially in clinical medicine. One of the options is to use multilingual BERT published by Google, however, the model would have disadvantage for word-based tasks because of its character-based vocabulary. As others for Japanese text, BERTs that have been pre-trained with Japanese WIKIPEDIA are published [12] [13], however, their applicability for the task of clinical medicine has not been studied. Because clinical narratives (physicians’ or nurses’ notes) should have differences in linguistic characteristics from text on the web, pre-training on clinical text would have an advantage for the clinical NLP tasks. In this work, we develop and publicly release a pre-trained BERT that make use of huge size of Japanese clinical narratives. We also demonstrate an evaluation with a shared NLP task to compare the developed clinical specific BERT with the three types of nonspecific BERTs for Japanese text.
Methods
Datasets
Approximately 120 million lines of clinical text for eight years stored at electronic health record system of the University of Tokyo Hospital were used. Data collection followed a protocol approved by the Institutional Review Board at the University of Tokyo Hospital (2019276NI). Those texts are mainly recorded by physicians and nurses for daily clinical practice. Because Japanese text includes two-byte full-width characters (mainly Kanji, Hiragana, or Katakana) and one-byte half-width characters (mainly ASCII characters), the Normalization Form Compatibility Composition (NFKC) followed by full-width characterization to all characters were applied as a pre-processing.
Tokenization of Japanese text
To input a sentence into BERT, it is necessary to segment the sentence into tokens included in the BERT vocabulary. In non-segmented languages such as Japanese or Chinese, a tokenizer must accurately identify every word in a sentence before attempt to parse it and to do that requires a method of finding word boundaries without the aid of word delimiters. To obtain BERT tokens from a Japanese text, morphological analysis followed by wordpiece tokenization is applied. Morphological analyzer such as the MeCab [14] or the Juman++ [15] is commonly used in Japanese text processing to segment a source text into word units that are pre-defined in its own dictionary. Subsequently, wordpiece tokenization is applied, which segment a word unit into several pieces of tokens included in BERT vocabulary. In the wordpiece tokenization, the word like playing is segmented to play and ##ing, the latter is called subword. Fig 1 shows a schematic view of morphological analysis and wordpiece tokenization of a Japanese text.
Making BERT vocabulary
A BERT model requires a fixed number of token vocabulary for word pieces embeddings. To make the BERT vocabulary, candidate word pieces were obtained by applying morphological analysis followed by the Byte Pair Encoding (BPE) [16] to the entire dataset. The MeCab was used as a morphological analyzer along with the mecab-ipadic-NEologd [17] and the J-MeDic [18] as external dictionary. The former has been built utilizing various resources on the web, and it was used to identify personal names from the clinical text as much as possible and aggregate them into a special token (@@N). The latter is domain specific dictionary that has been built from Japanese clinical text, and it was used to segment words for disease or findings in as large a unit as possible. BPE first decompose a word unit into character symbols, and subsequently create a new symbol by merging two adjacent and highly frequent symbols. The merging process is stopped if the number of different symbols reaches the desired vocabulary size. In addition to this process, candidate words that evoke specific persons or facilities were excluded by manual screening, which make the developed BERT publicly available. Eventually, 25,000 tokens including special tokens were adopted as the vocabulary.
Pre-training BERT
BERT has shown state of the art results for a wide range of tasks, such as single sentence classification, sentence pair classification, question answering without substantial task specific architecture modifications. The novelty of BERT is that it took an idea of learning word embeddings one step further, by learning each embedding vector directly from the sequence. In order to do this, BERT utilizes the self-attention mechanism, which learns sentence expressions by capturing co-occurrence relationships between words. Pre-training of BERT is performed by inputting fixed-length tokens obtained from two sentences and optimizing the Masked-LM and the Next Sentence Prediction simultaneously. Since these two tasks do not require manually supervised labels, the pre-learning is conducted as self-supervised learning.
Masked-LM
Fig 2A shows a schematic view of Masked-LM. This task masks, randomly replaces, or keeps each input token with a certain probability, and estimates the original embeddings of those tokens. By estimating not only the masked tokens but also the replaced or kept tokens, more appropriate representation of sentences is obtained. Although the selection probability of the tokens to be dealt with is arbitrary, we used the 15% mentioned in the original paper.
Next Sentence Prediction
Fig 2B shows a schematic view of Next Sentence Prediction. In this task, the model receives pairs of sentences and predicts whether the second sentence of the pair is the consecutive sentence in the original dataset. To develop such a training dataset, for two consecutive sentences in the original dataset, connect the original consecutive sentences with a probability of 50% as a positive example, and connect the remaining 50% of the sentences randomly sampled as a negative example. In general, two sentences connected by a period are assumed to be consecutive, and two sentences that pinch a line feed code are assumed to be non-consecutive. However, because for simply displaying clinical narratives in a narrow screen area, it seemed to be that clinical sentences tend to pinch a line feed code even if they were originally consecutive sentences. In other words, a series of sentences referring to the same topic appears as a list of non-consecutive short sentences, which is considered to have a negative impact on the training of Next Sentence Prediction. For this reason, we treated all sentences appearing in a document recorded in one day for a patient as consecutive sentences.
Evaluation task
The performance of the developed BERT was evaluated by fine-tuned approach using the NTCIR-13 Medical Natural Language Processing for Web Document (MedWeb) [19]. The MedWeb is publicly available and provides pseudo-Twitter messages about medical concerns in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels. Positive or Negative is given to eight labels of Influenza, Diarrhea, Hay fever, Cough, Headache, Fever, Runny nose, and Cold, and Positive may be given to multiple labels in a message. We performed a multi-label task to classify these eight classes simultaneously. Table 1 shows examples of each set of pseudo-tweets.
Experiment settings
For the pre-training experiments, we leverage the Tensorflow implementation of BERT-BASE (12 layers, 12 attention heads, 768 embedding dimension, 110 million parameters) published by Google [2]. Approximately 99% of 120 million sentences was used for training, and the remaining of 1% was used for the evaluation for the accuracies of Masked LM and Next Sentence Prediction.
For the evaluation experiments, the pre-trained BERT was fine-tuned. The network was configured such that the output vector C corresponding to the first input token ([CLS]) is linearly transformed to eight labels by a fully connected layer, and the positive or negative of each of the eight labels are outputted through sigmoid function. Binary cross entropy was used for the loss function, and the learning rate was optimized by Adam initialized with 1e-5. All network parameters including BERT are updated during this fine-tuning process. Fig 3 shows a schematic view of this network. The models are trained and tuned by five-folds cross-validation, which split 1,920 MedWeb training data into 4:1. The average results of each five tuned model for 640 MedWeb test data was assessed. The performance was assessed by the exact match accuracy and label-wise F-measure (macro F1).
To inspect an advantage of the domain specific model, we also evaluated the two kinds of domain nonspecific BERT that are pre-trained in Japanese Wikipedia and Google multilingual BERT. In addition, to inspect the difference of the performance due to input features, we also evaluated Support Vector Machine (SVM) and Logistic Regression (LR) that use the same morphological analyzer, dictionary and vocabulary to each BERT. Table 2 shows the specifications of each BERT model. In this table, the total number of unknown tokens appeared in the MedWeb dataset is shown, which are obtained by the tokenizer with each BERT vocabulary.
Results
Pre-training performance
Table 3 shows the results of the pre-training. The pre-training was almost saturated at about 10 million steps (4 epochs), and the accuracies of Masked LM and Next Sentence Prediction were 0.773 and 0.975, respectively. With a mini-batch size of 50, 2.5 million steps are equivalent to approximately 1 epoch. It took about 45 days to learn 4 epochs using single GPU. In the following experiment, UTH-BERT with 10 million steps of training was used.
Finetuning performance
Table 4 shows exact match performance of four kind of pre-trained BERT, SVM and LR. Although there was no significant difference due to the number of cross validations, the average performances have better tendency in the order of BERT models (0.851), LR models (0.724) and SVM models (0.701). Among the BERT models, UTH-BERT (0.862) performed the best and Google-ML BERT (0.827) performed the worst. There were slight differences among UTH-BERT (0.862), KU-BERT (0.855) and TU-BERT (0.858), which indicates no significant evidence for the advantage of domain-specific model. For SVM models, the performances have better tendency in the order of SVM1, SVM2, SVM3, which use the same tokenizer and vocabulary to UTH-BERT, KU-BERT and TU-BERT, respectively. This considered to indicate the suitability of each vocabulary for this evaluation task. It was also that LR models showed the same tendency as SVM models.
Table 5 shows the label-wise F-measure (macro F1) of each model. Like the exact match performance, the average F-measures have better tendency in the order of BERTs (0.883), LRs (0.839), and SVMs (0.827), and the average performances of the eight classes among BERTs were highest for UTH-BERT (0.893) and lowest for Google-ML BERT (0.867). As for the performances for each class, the performances of SVM2 (0.576), SVM3 (0.507), and LR2 (0.610), LR3 (0.539) for cough were significantly lower than the others (95% CIs were not shown).
Discussions
Interpretations of BERT model
In this study, we demonstrate a pre-training of BERT model with a huge size of Japanese clinical narrative and evaluated it on the NTCIR-13 MedWeb task. To the best of our knowledge, this work is the first to publish and inspect a BERT model that is pre-trained by Japanese text in clinical domain. Among the machine learning methods, BERT shows the most successful performance both in the exact match accuracy and the label-wise F-measure. A possible explanation for the success is that it is pre-trained on a large corpus, which affords good representation of the input sentence. As we shown in Table 5, the performances of SVM2, SVM3 and LR2, LR3 for cough were significantly lower than the others. This is because the vocabulary of KU-BERT for SVM2 and LR2, and TU-BERT for SVM3 and LR3 have few variations of cough symbols (zero in KU-BERT, two in TU-BERT and fifteen in UTH-BERT) and the relevant words tend to be treated as unknown words. Those unknown words are semantically ambiguous and had a negative impact on the classifier. Nevertheless, the performances of KU-BERT and TU-BERT for cough did not decrease. This was considered to be due to the Masked-LM that acquires sentence expressions from the context even if there are some unknown words. Similar phenomenon was also observed in SVM and LR models in Table 4. In this table, SVM 2, SVM 3, LR2 and LR3 have significantly lower performance than SVM 1 or LR1, which indicate less adaptable of the vocabulary of KU-BERT and TU-BERT for this evaluation task. However, KU-BERT and TU-BERT themselves show comparable performance to UTHE-BERT. This indicate that even if the vocabulary is less adaptable to the evaluation task, the negative impact on BERT performance is not so strong.
Advantage of pre-training on domain text
Among the BERT models, the performances of UTH-BERT, KU-BERT, and TU-BERT, which specialized to Japanese texts showed higher tendency than Google multilingual BERT. The multilingual BERT use character-based vocabulary that alleviate vocabulary problems for handling multiple languages instead of giving up the semantic information that words have. This result would indicate a disadvantage of character-based vocabulary compared to word-based vocabulary. Even so, the performances of the multilingual BERT were comparable to the other Japanese specific BERT models. With regard to the advantage of pre-training with domain-specific text, UTH-BERT pre-trained with clinical text tended to show higher performances than KU-BERT and TU-BERT pre-trained with Wikipedia, however, no significant differences were found. One of the reasons is that because sentence classification is a relatively easy task for BERTs which have pre-trained on large text, the advantage of pre-learning with the domain text may not have been noticeable. Another reason is that because the NTCIR-13 MedWeb used for the evaluation is intermediate corpus between WEB and medicine, the differences among BERTs may not have been clear. The advantage of training on domain-specific texts may become apparent in the more complex tasks such as question answering or causality recognition using actual clinical text, and such corpus for the evaluation is required to be developed.
Limitations and future works
Given that our developed BERT was evaluated exclusively on the NTCIR-13 MedWeb task, there is currently a limitation in the generalizability of the performance. Our purpose is to develop tools that will be useful for NLP in clinical domain, though, knowing their nature requires evaluations on the more complex clinical or non-clinical tasks. In parallel, error analysis is required to investigate the mistakes made by the BERT based model. For this purpose, visualizing self-attentions [20] in the BERT model will be useful. It is also that useful linguistic information can be found not only in the self-attentions themselves, but also in the attention maps that represent the relation of word to word attention between adjacent layers [21]. In future task, the detailed error analysis with those visualizing method of self-attentions mechanism is required, which may contribute in the understanding of how BERT models understand language.
Data Availability
1. Clinical text used for pre-training the UTH-BERT cannot be shared because of not publicly available, and restrictions apply to their use. 2. The UTH-BERT model is available for non-commercial use at the following URL: https://ai-health.m.u-tokyo.ac.jp/uth-bert 3. NTCIR-13 dataset is available upon request at the following URL: https://www.nii.ac.jp/dsc/idr/ntcir/ntcir.html
Supporting information
The pre-trained BERT model with instruction for running the code is available below. https://ai-health.m.u-tokyo.ac.jp/uth-bert
Author contributions
Conceptualization: YK, ES
Investigation: DS
Methodology: YK, DS
Resources: ES, KO
Supervision: EA, KO
Writing - original draft: YK
Writing - review & editing: YK, EA