ABSTRACT
Electronic Health Records (EHRs) often lack reliable annotation of patient medical conditions. Yu et al. recently proposed PheNorm, an automated unsupervised algorithm to identify patient medical conditions from EHR data. PheVis extends PheNorm at the visit resolution. PheVis combines diagnosis codes together with medical concepts extracted from medical notes, incorporating past history in a machine learning approach to provide an interpretable “white box” predictor of the occurrence probability for a given medical condition at each visit. PheVis is applied to two real-world use-cases using the datawarehouse of the University Hospital of Bordeaux: i) rheumatoid arthritis, a chronic condition; ii) tuberculosis, an acute condition (cross-validated AUROC were respectively 0.948 [0.945 ; 0.950] and 0.987 [0.983 ; 0.990]). PheVis performs well for chronic conditions, though absence of exclusion of past medical history by natural language processing tools limits its performance in French for acute conditions.
INTRODUCTION
As the amount of data collected on a daily basis from hospital health care system keeps increasing,[1] the appeal for leveraging the full potential of these data for research purposes and to investigate clinical questions is also becoming stronger than ever.[2–5] Yet, EHR data are quite different from research oriented data (e.g. cohort or trial data): i) they are less structured, more heterogeneous, ii) they present finer granularity, iii) data collection is done for health care purpose.[1,6–8] Currently, one of the main barriers to use such data for studying disease risk factors is the necessity to first identify patients having diseases of interest, a task that we will denote as phenotyping.
Several approaches have been recently proposed to phenotype patients.[9] They often rely on either rule-based algorithms specifically designed with clinicians, or on supervised models trained on annotated patient datasets. Such algorithms are limited because their development is disease specific, must be (re-)started from scratch for every new disease and demands a lot of clinician expertise time. In addition, portability and generalization to new databases (e.g. different hospitals) can often fail, requiring once again the process to be reiterated in the new institution. Hripcsak and Albers defined high-throughput phenotyping as an approach that “should generate thousands of phenotypes with minimal human intervention”.[8] In the same vein, Yu et al. developed an unsupervised framework to phenotype diseases using EHR at the patient level.[10] It is fully automated and does not require any chart review neither rules definitions. They first apply a Surrogate-Assisted Feature Extraction (SAFE),[11] (an unsupervised feature selection method), and then PheNorm,[10] (an unsupervised phenotyping method), to estimate the disease status at the patient level. Both PheNorm and SAFE primarily rely on the main International Classification of Diseases (ICD) codes and the main Unified Medical Language System (UMLS) Concept Unique Identifier (CUI) referring to the medical condition of interest (e.g. for rheumatoid arthritis: ICD codes M05 Rheumatoid arthritis with rheumatoid factor and M06 Other rheumatoid arthritis, and CUI C0003873 Rheumatoid arthritis). ICD codes come from the medical billing process of the hospital whereas UMLS codes are extracted from patient’s clinical narrative notes with a natural language processing (NLP) tool. SAFE algorithm selects medical candidate features from a pool of publicly available sources (e.g. Wikipedia, MedLine, etc). The main ICD codes and main CUI counts are used to create silver-standard labels for the medical condition of interest. Relevant features are then selected using an elastic-net regression on the silver labels. PheNorm algorithm also uses the main counts as a surrogate of the condition of interest and combines features into a phenotyping score through a noising-denoising step. The AUROCs of PheNorm were similar to supervised algorithms for rheumatoid arthritis.[10]
Although this framework is appealing, we need to go further than phenotyping at the patient level, especially for studying acute diseases (that can occur repeatedly) or for answering epidemiological questions (where temporal sequence is important). Phenotyping at the visit level allows taking into account the dynamic evolution of patient’s conditions. In addition, the SAFE-PheNorm framework was developed using English databases, leveraging advanced NLP tools and rich terminologies available in English.[12,13] Portability to other languages is not straightforward, as they still often lack resources of matching quality.
We propose a new approach of unsupervised algorithm extending PheNorm at the visit level: PheVis. It will be evaluated for rheumatoid arthritis (RA) and tuberculosis (TB), a chronic and an acute condition respectively, using French EHRs from the University Hospital of Bordeaux.
METHODS
PheVis combines ICD billing codes together with medical concepts extracted from medical notes, incorporating past information through a user-tunable exponential decay. This creates a silver-standard surrogate of the medical condition of interest. Then variable selection (through elastic-net logistic regression) and pseudo-labellisation (using random-forest) are performed, leveraging extreme values of this silver-standard. Finally, a logistic regression model is estimated on those noisy labels to provide an interpretable “white box” predictor of the occurrence probability for a given medical condition at each visit. The different steps of PheVis are outlined in Figure 1. We will briefly explain the four main steps for training the PheVis algorithm (a detailed description of the method is available in the supplementary material).
Medical concepts of interest for a given disease are extracted from clinical notes and from ICD10 billing codes into a matrix where each row is a visit.[14] The first step is to sum main ICD codes (i.e. for RA: M05 Rheumatoid arthritis with rheumatoid factor and M06 Other rheumatoid arthritis) and CUI concepts (i.e. for RA: C0003873 Rheumatoid arthritis). As CUIs occurrences largely outnumber ICD ones, both are first standardized (otherwise the CUIs information would largely dominate the learning). Second, the information is cumulated over time according to an exponential decay law whose half-life can be tuned by the user through the λ parameter mentioned in figure 1 (for a chronic disease, all past history will be cumulated with an infinite half-life, while for an acute disease it will be set closer to 0 as the disease won’t last for many visits). For instance a patient having RA might come for an acute infectious disease where clinicians won’t focus on RA; past information is needed to predict current medical condition. Third, the previously defined cumulative sum mCumulij for patient i at visit j is transformed into Sij, a categorical variable that has three possible values: 0 (extremely low cumulative sum), 0.5 (medium cumulative sum), or 1 (extremely high cumulative sum), defining silver-standard labels. To set the thresholds separating these three categories, we use the prevalence of the main ICD code denoted pICD. Sij is set to 1 if the mCumulij is among the top pICD/2visits, or set to 0 if mCumulij is among the lowest pICD/2; other visits are set to 0.5. This takes into account the variability of disease prevalence in the training cohort. Fourth, the disease probability is estimated by i) first attributing a pseudo-value to each visit regarding the phenotype (either 0 or 1) using a random-forest trained on extreme visits with Sij = {0,1} (this optional step improves the performance by smoothing the predictor) ii) second fitting a logistic regression for estimating the probability of the medical condition of interest at each visit considered. This finally yields a phenotype prediction at the visit level.
RESULTS
Application design
We illustrate the PheVis method on RA, a chronic disease which cannot be cured, and active TB, an acute disease which usually last between 6 to 12 months.[15–18] The model performance was evaluated on an imperfect gold standard for both diseases: for RA we used the presence of at least one rheumatoid arthritis form specifically used by rheumatologists at the University Hospital of Bordeaux in usual care, and for tuberculosis we manually reviewed patients with at least one mention of tuberculosis treatment while other patients were considered not having the disease. Latent tuberculosis was labelled as tuberculosis negative because even if it the bacterium is the same, symptoms, diagnosis and treatment are different. Patients were included if they had been hospitalized at the University Hospital of Bordeaux at least once since 2010 and if they had either one primary or secondary ICD code of RA (M05 or M06), or one biology measurement of Anti-Citrullinated Peptide Antibody. The cohort was split into training and test datasets at patient level with a 70% to 30% ratio. The cohort is described in Table 1, highlighting the discrepancy between ICD, CUI and gold-standard justifying the need for phenotyping algorithms.
Four different prediction models were evaluated for each disease: i) using mCumulij as a direct predictor of the phenotype, ii) our proposed PheVis approach, iii) a supervised elastic-net model trained using the gold standard directly, iii) a supervised random-forest trained using the gold standard directly. For PheVis we choose and (tuberculosis typically lasting around 6 months).
Application results
Figure 2 shows individual PheVis predictions for four patients chosen for their various profiles. For each disease, the model was able to correctly capture the beginning of diseases. For TB however, it failed to return the prediction to a near zero probability when the disease ended. Two main reasons can explain this phenomenon: i) the cumulative constant λ value was poorly chosen; and ii) because of limitations in the NLP pre-processing step (no section segmentation to exclude past medical history), PheVis is not able to distinguish current information from past history. This has less impact for phenotyping RA because both current and past information positively predict the phenotype (which is generally the case for chronic conditions).
Figure 3 shows the global performance of PheVis on the test set compared to supervised models and mCumulij for both diseases. The performances for RA were satisfying, while for TB only the Receiver Operating Characteristic (ROC) curve is satisfying (mostly an artifact due to the low prevalence of TB) and not the precision-recall one. For TB low performance is partly explained by the hard task of distinguishing active TB from latent TB and other mycobacteria infections as both have similar treatment and vocabulary.
Table 2 shows point performance for two arbitrary phenotyping decision rules: i) a predicted probability above 0.5 ii) a probability above the threshold maximizing the sum of the precision and the recall. Specificity and negative predictive values are good partly because the diseases are rare at the visit level. Matching the results from figure 2, the sensitivity/positive predictive values trade-off is better for RA than for TB.
DISCUSSION
We developed PheVis as an unsupervised automatic phenotyping algorithm at the visit level. It is able to achieve interesting performances for RA, which is promising for other chronic conditions, but suffers from limitations when it comes to acute conditions such as TB. PheVis is nearly fully automated, not requiring any (time-consuming and expensive) chart review, and its framework can in theory be used for different kind of medical conditions (either acute or chronic). It resembles the human medical probabilistic approach of diagnosis as the output is a probability taking into account the uncertainty of the information inside EHR.[19]
PheVis adds many innovations to the previous PheNorm algorithm it builds upon: the needs for standardizing the information from medical notes and ICD codes, the accumulation of past history with exponential decay, the definition of silver standard using ICD codes to take into account prevalence of the disease, and pseudo-labelling to improve performance and increase stability of predicted probabilities. Also we demonstrated the portability (and limitations) of those methods in French and in a different datawarehouse than the one used to develop PheNorm, with consistent performances for phenotyping RA.
These algorithms are highly sensitive to the input features, which emphasizes the needs for finer natural language processing tools able to perform semantic analysis. The use of other features such as biological test results or treatment should also be considered, as they should be highly predictive of the phenotype, but further works is needed to define how they could be integrated into the silver-standard surrogate.
The evaluation of the model is made against a questionable gold standard, mainly due to the lack of large annotated patient reference sets. For TB, the gold standard was manually curated, while for RA, we used a highly specific form but which might lack sensitivity: interestingly, upon manual inspection it appeared that PheVis was able to accurately recover RA patients visits of 5 patients who were not treated in the Rheumatology department of the University Hospital of Bordeaux and thus had no record of this specific form, resulting in a failure of the gold standard. Such phenomena might underestimate the algorithm performance.
CONCLUSION
PheVis might be able to provide a probability for a large set of diseases and medical conditions with little effort. The performances might vary depending on the disease of interest, the database and the language. The use of those estimated probabilities opens new horizon for the use of EHR for medical and epidemiological research purposes. PheVis is implemented in an R package available on Git (https://plmlab.math.cnrs.fr/fthomas/phevis2) and will soon be submitted to the CRAN.
Data Availability
Data were provided by the CHU de Bordeaux.
SUPPLEMENTARY MATERIAL
ICD codes
Detailed methods
1. Input data
The input data of the PheVis workflow are the clinical notes and the ICD codes from a datawarehouse. All the notes and ICD codes are collapsed by visit, and a dictionary based named entity recognition is used to extract CUIs.[14] ICD codes are aggregated at one letter and two numbers level (i.e. M05.1 -> M05). This result in an X matrix of φ × P dimension where φ is the total number of visits and P the total number of ICD and UMLS concepts. We will denote i ∈{1, …, n} the patient index and j ∈{1, …, vi} the visit index.
2. Build surrogate
As we do not have disease labels for the visits, we cannot use supervised modeling straight away. To be able to train our phenotyping algorithm, we first build a surrogate variable expected to be close to the true disease status. This surrogate is based on the main ICD and UMLS disease code.
We define mCij the standardized sum of main disease concepts as:
mainICD and mainCUI are main concepts related to the disease. For example for RA we used:
- mainICDij=M05ij+M06ij with M05ij the number of times the code M05 Rheumatoid arthritis with rheumatoid factor was recorded for patient i at visit j, and similarly for M06ij and M06 Other rheumatoid arthritis
- mainCUIij = C0003873ij with C0003873 Rheumatoid arthritis
The standardization is necessary because CUIs occurrences largely outnumber ICD code numbers. Without it, the weight of ICD codes in the prediction would be negligible.
To phenotype a visit, it is necessary to take into account previous visits information. For example, a patient can be diagnosed RA at the age of 50, have a visit at 52 for an infectious event containing no information about RA. We want to be able to predict RA in both visits. To do so we propose to cumulate past history information with an exponential decay as follow:
λ is a constant parameter set by the user controlling the “loss of memory” of the algorithm. For easier interpretation one can prefer to set the value of half-life equals to ln(2)/λ. The half-life chosen was the usual duration of the disease (180 days for TB and infinity for RA).
The same exponential decay accumulation is applied to each ICD and UMLS codes. We also define five other variables:
This yields an augmented matrix Xa of φ × (2P + 5) dimensions: CUIs and ICDs and their cumulated counts, and five new variables.
3. Variable selection
We used the SAFE algorithm to select predictive variables of interest and reduce the dimensionality of the optimization problem. First we used NLP to extract ICD and UMLS concepts in external resources.[15–18] A concept and its cumulative count were kept if it was found in the two resources. Then we categorized mCumulij into Sij ={0, 0.5, 1}. To define thresholds, we used mainICDij which takes into account prevalence variability depending on the disease and the cohort. We define quantextreme as the quantile of visits having at least one main ICD code as:
It allows us to define Sij as:
Then we trained a logistic regression with elastic-net penalization to select a subset X′of relevant variables from the Xamatrix:
Of note, mainICD and mainCUI are always forced into the set of selected variables, while Cumij is systematically removed for acute conditions.
4. Pseudo-labelling
We attributed a pseudo-label {0,1} to all visit. It increases the number of visits available to train the final logistic regression, and adds visits more uncertain phenotype status, which overall implies smoother predicted probabilities and better performance. To perform this pseudo-labellisation, we train a random-forest with majority vote trees aggregation as:
The model is used to predict PLij = {0,1} status for each visit.
5. Probability estimation
To estimate the disease occurrence probability, we used a noising-denoising logistic regression with random intercept similarly to PheNorm. First, max(105, φ) visits are randomly sampled with replacement with inverse probability weighting depending on PLij in order to balance the training set. This new matrix is denoted Xb. Then we performed a noising-denoising step to force the algorithm to use other variables than the main ICD and UMLS concepts (and thus avoid overfitting with respect to the surrogate). Every value of explanatory variables has a probability of pbern = 0.3 to be replaced by the mean of the explanatory variable, and this noisy matrix is denoted Xn:
For the denoising step a logistic regression with random intercept is used:
And finally the probability of having the disease is estimated on the noise free matrix as:
COMPETING INTERESTS
None.