Abstract
We develop here a data-driven approach for disease recognition based on given symptoms, to be efficient tool for anomaly detection. In a clinical setting and when presented with a patient with a combination of traits, a doctor may wonder if a certain combination of symptoms may be especially predictive, such as the question, “Are fevers more informative in women than men?” The answer to this question is, yes. We develop here a methodology to enumerate such questions, to learn what are the stronger warning signs when attempting to diagnose a disease, called Conditional Predictive Informativity, (CPI), whose ranking we call CPIR. This simple to use process allows us to identify particularly informative combinations of symptoms and traits that may help medical field analysis in general, and possibly to become a new data-driven advised approach for individual medical diagnosis, as well as for broader public policy discussion. In particular we have been motivated to develop this tool in the current environment of the pressing world crisis due to the COVID 19 pandemic. We apply the methods here to data collected from national, provincial, and municipal health reports, as well as additional information from online, and then curated to an online publically available Github repository.
1 Introduction
As healthcare systems around the world copes with the COVID-19 crisis, concerns about the ongoing spread of the disease remain, in part because of an as yet uncertain size and spreading role concerning asymptomatic and low symptomatic population, and also the true false negative rate [14] of available tests. Such a clear threat to humanity demands our best informative ability to recognize its traits across populations. Clinical diagnosis remains a major line of defense, especially while direct tests remain uncertain and not widely available.
We develop in this work a Conditionally Predictively Informative Ranking (CPIR) method, that adopts data-driven principles of information theory, as an extension of the principle of causation entropy [11], to give a reliable ranking that reflects the direct informativity of each symptom, after considering the underlying relationships between the symptoms. See Fig. 1. In brief, this approach contrasts directly to the commonly used concept of correlations [8, 2, 7, 10], that associates a score to describe how strongly measurements go together, whereas, CPIR associates a score to describe how strongly atypical is a new observation, given that a collection of symptoms has already been observed. Often it is the presence of atypical observations that allow the clinician to decide how worrying should be considered the constellation of symptoms presenting in a given patient. Likewise, understanding atypical associations can be especially informative when characterizing a disease across the population.
We define a Conditional Predictive Informativity (CPI) measure in terms of conditional mutual information (I) to describe how much information is to be inferred if a given patient has certain symptoms, given already stated other symptoms are observed. That is, in answering the question, “Is a cough especially informative if a fever has already been observed in a patient,” would be stated with the CPI as the following and relating to conditional mutual information, referring to the available COVID-19 dataset, and as shown in Fig. 1, it turns out those with a fever is about 80% of those already have a cough, so while correlation picks that association it is exactly not this association we wish to identify with CPI.
Referring to the available COVID-19 dataset, we find that less than 20% of patients who have chills also have a headache. So the answer to the question is yes, observing chills is informative given a headache. Moreover, chills is crucially informative. As we show in Fig. 1, patients can be seen as typical and atypical patients. Both groups have tested positive for the disease, however, while the majority of the patients shares the same set of symptoms and reaction to the disease, a subgroup will have different reactions that may appear as severe pain and symptoms, and even death. In COVID-19 statistics, we see that more than 80% of the patients have mild symptoms, or they may even be asymptomatic, while a small group have severe pain and risky symptoms, and the global mortality rate of COVID-19 is less than 7% of confirmed cases.
Even while the number of patients who have chills is much lower than patients who have fever, see in Fig. 1, this symptom turns out to be more informative than fever, meaning that chills specially appears in atypical patients, but not in typical patients. However, a fever appears in both groups, and furthermore, it is highly associated with other diseases. This paper is devoted to developing and describing how to compute these kinds of classifications and conclusions.
Ranking on this conditional measure allows us to learn the markers of the disease. To state one such example outcome, our analysis shows that chills combined with pneumonia is especially informative when it occurs in young women, more so than in other patients. It is a standout collection of symptoms. Another example is that headaches and body aches are highly informative of atypical patients when they appear in men. More results of this type are shown in Fig. 2. With such examples in mind, our methods here will allow us a data-driven method to define an informativity ranking of symptoms toward a reliable prediction of disease presence based on the specific combination of informative symptoms. The knowledge gained when observing a specific combination of symptoms relative to the further predictability of the status of other symptoms founds the essential ability to understand how to understand observations in the clinical setting.
With the idea of informative collections of symptoms, then we offer the CPIR, as a ranking of these informative symptoms, sorted so as to describe those symptoms that when occurring together, make for unmistakable signs of the disease. Stated another way in terms of an example, a patient who is observed to have a dry cough and a fever is not so indicative of the disease since so often the fever and a dry cough go together. Just as even more so in the extreme, observing a dry cough and then the presence of a left foot, then furthermore observing the presence of a right foot is not so informative as a further observation. However, as it turns out we find, a cough may be indicative, but then a fever may be more informative if furthermore the patient is female.
Individual-level epidemiological data from the COVID-19 outbreak, are publically available from the international resource [15]. These are collected from national, provincial, and municipal health reports, as well as additional information from online, and these data are collectively curated for general use. This data is continually updated, as described in the paper associated with the Github repository. As of April 16, the dataset has more than 267,000 entries for individual-level data. Fig. 2 is a table of the more striking combinations of informative symptoms and traits, that we have found in this study, the methods of which to rank the CPI are developed below.
2 Methods
In our previous work, [1], we introduced the method of Entropic Regression, which adopts the principle of causation entropy, [11], for the discovery of underlying dynamics based on the influence of a set of candidate functions to the outcome dynamics. However, in the case of Boolean outcomes, and mixed datasets, where the outcome depends on a combination of variables that could be real numbers, Boolean variables, or descriptive data, we must extend principles of causality inference and minimal description.
2.1 “Booleanization” of Non-Boolean Variables
Let Y ∈ {0, 1} be the outcome that a patient has a disease, or not, such that:
The patient data, or the variables that influence the outcome, are mostly mixed variables with different data types. See Fig. 3. Further, define the ith patient variables to be, (i.e. a row in the table shown in Fig. 3). Here, are symbolic variables that each has independent sets of labels (i.e. symbols, category,…,etc.) , and ns is the number of symbolic variables. Similarly, are Boolean variables, and nb enumerates the Boolean variables. are real valued variables, and nr is number of real valued variables.
For example, and as preliminary introduction to our approach, assume that the variable s:,2 is an N - dimensional vector, representing the number of observations or diagnosis states of symptoms of a given patient. Symptoms are labeled; for example l1 represents presence of a fever, l2 represents presence of a cough, and l3 represents a headache, or possibly even the degree of these. Here we illustrate with example that K2 = 3 is the number of labeled descriptions, or the number of symptoms that will be observed in patients. For sake of clarity we will drop the subscript and we write the vector of symptoms for all patients as s. For a given patient i, who for example has a fever and a headache, we write the entry si = {l1, l3}, representing occurrence of a fever and a headache. Then, we can translate the vector descriptive variable s, to multi-vector (matrix, see Fig. 3) Boolean variables , such that:
So, the function ß (s) converts the column vector s of N descriptive entries, to an N × K Boolean matrix S, where the kth column, k = 1, …, K, of the Boolean matrix describes the occurrence of the labeled description lk on s. Fig. 3 shows a schematic illustration for this local “Booleanization” process. With variables Booleanized, we now describe in the next section how to construct a conditionally predictively informative ranking (CPIR) for the symptoms.
It is clear that real valued variables can be converted to a set of Boolean variables by thresholding and then these translated to labels and then to a Boolean matrix as discussed above.
2.2 Conditionally Predictive Information
Now we describe how certain Boolean factors may be predictive of other factors, given the status of yet other factors. This is a variation on the theme of our previous work in causation entropy [3, 11, 12, 13], but here given the nature of the expected data and the corresponding difference of the underlying questions, the role of time is less relevant than the notion of indicative and predictive. Our goal here is to define an informativity ranking of symptoms toward reliable prediction of disease presence based on the combination of informative symptoms.
The Conditionally Predictive Information (CPI) described here as the conditional mutual information between one Boolean variable Sk and the outcome Y is given by: where k∗ is the set of Boolean variables indices {1, …, K} −{k}. That is the mutual information between the kth Boolean variable and the outcome Y, conditioned on all other Boolean variables. While this is therefore just a conditional mutual information [5], it is the way the conditioning set is designed, and how these are ranked that make this approach especially relevant to our needs here. Thus, it is a measure of knowledge one gains in observing some of the variables to further the possibility to predict the status of other variables, which is a crucial when the goal is to uncover a minimal and most informative set of variables.
Eq. 4 quantifies the information added by observation of the Boolean variable Sk, given the state of all other symptoms. However, this does not inform the directionionality of influence between Sk and the outcome Y. For example, the Boolean variables a = [111000] and b = [000111] have high mutual information with the outcome y = [111000]. However, while the event occurrence in y is associated with the event occurrence in a. We see that the event occurrence in y is associated with non-occurrence of the event in b. Then, the mutual information I(a; y) = I(b; y) may result, but this reveals no information about the directionality of the relationship. To address this issue, we re-write Eq. 4 as: where Γ is a sign function that is given by: where P r(Y|Sk) is the conditional probability of Y given Sk, and (¬) is the logical NOT operator of a Boolean variable. That is, if the occurrence of (Y = 1) associates with occurrence of (Sk = 1) more than it is associates with Sk = 0, then this describes a positive relationship. Otherwise it is negative relationship, that occurrence of Y is associated with non-occurrence of Sk.
Here, we want to emphasize a major difference between our approach and the commonly used methods from statistical analysis by correlations from data. Statistical analysis based on correlations [8, 2, 7, 10] has a different goal than we have here. Correlations describe variables and outcomes in terms of probability of occurrence, but it is well known that correlation does not imply causation [9]. For example, suppose that 80% of the patients (Y = 1) have a fever, and 50% of them have a cough. Then, fever and cough are the dominant symptoms. However, this analysis neglects the correlation among the symptoms themselves, but most important to our interests here, it does not reveal if the symptoms are informative or not. See Fig. 4 for a detailed example. We are less interested in the cause in this work, and instead, we are interested in developing an Informativity Rank (IR) of the symptoms. That is, for a set of symptoms, what are the symptoms that are Conditionally Predictively Informative (CPI) of an underlying condition such as the presence of a disease.
Assume an example in the extreme for discussion, that 99% of the patients who have a cough also have the fever. Then clearly there is (almost) no further useful information provided by observing the cough that was not already provided by observing the fever. Thus, the cough on its own can not stand as an informative symptom if not combined with the fever. Seeking informative variables is our main objective, and the conditional mutual information, carefully conditioned as stated and with the optimized conditioning set, can reveal this informative set by discovering the set of direct influences between the variables and the outcomes. Finding the conditional mutual information in Eq. 4. In other words, this translates to asking, “What is the direct influence (information) provided by the symptom Sk, given the information from all other symptoms, Sk∗ ?” By this local analysis, we will infer the dominant variables are most informative. Furthermore, we will show this analysis in terms of the COVID-19 dataset discussed above. First, however we must explain an important issue associated with uni-state outcomes.
2.3 Uni-State Outcome Variable
In the available COVID-19 data, we must cope with a common and well known problem called selection bias, [4]; this problem realizes itself here in that we generally only have the data acquired from actual diseased patients, meaning that we only have the data for the people who have already tested positive. We call this the problem of a uni-state outcome variable, Y = 1, for all the patients.
Considering the COVID-19 dataset, we found about 26 commonly reported symptoms (fever, cough, …, etc.), including the asymptomatic cases to be within the possible combinations of symptoms. This means that for each patient i, we have a binary string Si of 26-bits that represents the symptoms that patients have. For simplicity of discussion, we will consider an example of 4 symptoms ordered as follows: {Asymptomatic, Fever, Cough, Abdominal pain}, but of course we use all the available data in our analysis. Further, if certain of the symptoms for the patient are not reported, then we have Si = [0, 0, 0, 0]. If it is reported that the patient has no symptoms, we have Si = [1, 0, 0, 0]. If the patient has a fever and a cough, we have Si = [0, 1, 1, 0], and so on.
For a 26-bit binary string, there exists the possibility of up to 226 or more than 67 million possible combinations of symptoms. However, we found that in the actual data set, there were expressed in fact only nu = 124 unique binary strings Su for the actual symptoms observed. In Fig. 5, we show the histogram of Su derived from the data. Observe that not only do we have a low count of unique states, moreover, most of the unique states have a very low frequency of occurrence F = 1, F = 2. Low frequency events, associated with a low probability or otherwise rare events, may be especially interesting to be studied individually as these outliers may in fact be specifically informative. However, our intention here is to rank the symptoms based on their informativity of the disease on the majority of patients.
Note that we seeks to infer the CPI that describes the outcome of most patients, and hence, we assume that the outcome Y, is equal to 0, if the patient has a binary string Si that has low probability. Mathematically we write: where Pr(Si), is the probability of occurrence of the binary string Si, and δ is a probability tolerance. In order to choose δ, assume that F is the ordered frequency (histogram) of the unique set of symptoms Su, such that . To investigate how each state in Su provides additional information, we consider to analyze the entropy of the histogram sequentially. Let the entropy Ei be the entropy of the probability distribution of the frequency entries, given by: where represents the probability of the jth entry in F with respect to the assumption that only the first i states are available. In another words, at each i we are asking what is the entropy of Su if we were to assume that we have only have the first i states?. This allow us to track, starting from the low frequency states, how the large frequency states affects the information (entropy) of Su. Fig. 6 shows the entropy curve from Eq. 8.
It will be a subject for our future work to connect this approach to the theory stemming from the asymptotic equipartition property (AEP) [5], and in so doing, to discuss the optimality of the value of δ, whereby in analogy to AEP, we are classifying the patients as associated with typical and atypical sets. Thus, most of the information is interpreted to be associated with the typical set of patients. For our current analysis, we consider , where is the frequency at the maximum entropy, and N is the sample size. For our dataset, we found δ = 2.69 × 10−5.
Now, given δ, we have the outcome Y from Eq. 7, and the set of Boolean variables S that describes the symptoms. We apply Eq. 5 to find the conditional mutual information (CMI) of each symptom. The CPIR is the scaled CMI, ∼ [0, 1], to indicate the rank of each symptom and it is shown in Fig. 7. In Fig. 8, we show the CMI between the symptoms themselves, which are the CMI between each pair of symptoms, given the information from all others. This mutual information in Fig. 8 between the symptoms, can practically lead to a fuzzy classification of the critical symptoms since it is hidden for the commonly used statistical techniques.
2.4 Computational Approach
The conditional mutual information associated with the CPI, Eq. (4), requires that we review the conditional mutual information [5], and remark how the associated probabilities may be estimated. Let X, Y, and Z be jointly distributed random variables with associated probability density function p(x, y, z). The conditional mutual information between X,Y given Z according to Eq. 4, is given by: where H is the entropy function, and H(‥) is the joint entropy of two variables. The entropy of a discrete random variable X with pmf pX (x), and n possible states is given by
For a Boolean random variable X, let be the probability that X = 1. We do not emphasize here how to efficiently estimate other than to note that typically this may be done by counts of relative occurence. Then, the probability that X = 0 is , and the entropy of X is then given by:
For the joint entropy of discrete variables in Eq. 9, we can think of the joint variable XZ, for example, as a new variable with a two dimensional outcome state (concatenation of the two variables as two columns), for which we call the joint space of X and Z. Then, the entropy can be found by the probability distribution of the unique states in XZ (unique rows), using Eq. 10.
Now, we can algorithmically obtain our conditionally predictively informative ranking (CPIR) by the following loop through all symptoms: and the ranking CPIR is then given by CPIRi = CIPi/ max(|CPI|). In Fig. 7, we show the results with plotting the absolute value of CPIR after descending sorting, which gives more clear view and readability for the figure.
3 Results, Symptoms and Informativity Questions
Now we are in a position to answer very simple but important questions, such as, “Are fevers more informative in women than men?” Clearly there are many comparably, important, and simple to state questions that become clinically relevant when a doctor may be presented with a specific patient presenting specific symptoms. One interesting question to ask is that if the symptoms have different CPIR depending of specific demographic variable such as sex or age. There are different ways to answer this question with the conditional mutual information. We may for example, add the Boolean vector of gender to the conditioning set, and track the reduction of CPIR for each symptom. However, another approach that overcomes the need to increase the size of the conditioning set is to replace the outcome set with the gender vector.
Let, Yi,1 = 1, if the patient i is female, and Yi,1 = 0 otherwise. Similarly, let Yi,2 = 1, if the patient i is male, and Yi,2 = 0 otherwise. Note that due to missing data and other factors, Y:,1 ≠¬ Y:,2, where (¬) is the logical operator (NOT). Then, we repeat the process for Y:,1, Y:,2, by using the symptoms matrix S, and Eq. 5. Fig. 9 shows the results of symptoms demographic informativity. Given δ, we have the outcome Y from Eq. 7, and the set of Boolean variables S that describes the symptoms. We apply Eq. 5 to find the CPI of each symptom. The CPIR is the scaled CPI, as discussed in the computations section, to indicate the rank of each symptom, and it is shown in Fig. 7.
The fever and cough are widely known as the main symptoms of COVID-19, and in our dataset, we found that 75% of patients have a fever, and 45% of them have a cough. However, Fig.7 shows that the most informative symptom is chills, and it is specifically informative that the patient is from the atypical set of patients. The fever, placed the second informative symptom of the typical patients, while the most informative symptom of the typical patients was the respiratory symptoms, which include respiratory infection, acute respiratory viral infection (ARVI), and acute respiratory distress syndrome (ARDS). Clinically, breathing difficulty can be listed under respiratory symptoms. Analyzing breathing difficulty individually, we found it to be the second informative symptom of atypical patients.
Clinically, fatigue and weakness are two different symptoms, where the weakness is defined as a failure to generate the required or expected force on first testing or attempted performance, and fatigue is defined as a failure to generate the required or expected force during sustained or repeated contraction [6]. Both symptoms are informative symptoms of COVID-19, however, weakness is more critical, since it is informative of atypical patients. Interestingly, since they are often mistakenly used alternatively, especially by the patients when they describe them, if we consider them to be equivalent, say both of them are fatigue, then it will be the most informative symptom among all of the other symptoms of COVID-19.
In Fig. 8, we show the CMI computed between pairs of symptoms, given the information from all others. This mutual information in Fig. 8 between the symptoms, can practically lead to a fuzzy classification of the critical symptoms since it is hidden for the commonly used statistical techniques. For example, in a statistical sense, we say that fever and cough are the most symptoms to appear in COVID-19 patients, and this analysis lack any consideration of the dependency between cough and fever, and that most of the patients who have cough are already have a fever. Fig. 8 shows this dependency as high direct information shared between cough and fever, and in our CPI approach, addressing these interactions between symptoms is embedded in Eq. 5, and we obtain the direct informativity between each individual symptom and the outcome.
To the question, are fevers more informative in women than men? The answer, according to the CPIR, is yes. In Fig. 9, we see that while fever is informative for the typical male patients, it is associated with the atypical female patients. In order to read Fig. 9 fully and correctly, we say that fever is an informative symptom of atypical patients of women, which means that it is more surprising to see women who have a fever, and hence, it can be a more critical or risky symptom to be taken especially seriously when it appears in women.
We must interpret that Fig. 9 helps in recognizing differences in symptoms informativity between men and women, but it does not replace the general audience discussion of Fig. 7, but rather it compiments it. For example we see that chills, which is the most informative symptom in Fig. 7, has zero CPRI in men and women but that does not mean it is not informative. Instead it means that chills have no difference in informativity between men and women. Finally we show in analogy to Fig. 9, a comparable assay of results for symptoms informativity in younger and older patients.
As a summary of our results querying for the most striking combination of symptoms and traits, see a tabular summary Fig. 2.
4 Discussion
In this paper we introduced a Conditionally Predictively Informative Ranking (CPIR) approach. In partciular we used this method to analyze COVID-19 symptoms, and to give a ranking of informative symptoms. In analogy to the symptoms example, other descriptive (labeled) variables can be analyzed to extract the informative descriptions in each variable that contains labeled descriptions. In our future work, we will extend the idea to consider symptoms-disease causality-driven networks, to construct informative networks that can give a signature of symptoms combinations. We hope this can be helpful as a data-driven approach for disease recognition based on given symptoms, and it can be efficient tool for anomaly detection.
From the presented mathematical methods applied to the data, the results are as shown in Fig. 7, Fig. 9 and Fig. 10. Summarizing from these figures, we highlight in Fig. 2 our main perhaps most striking combination of symptoms and traits as our results. While we have highlighted the COVID 19 in this discussion, for the obvious critical nature of this crisis, it is our hope that this tool, in particular the CPIR, may find utility for other disease analysis, and indeed for other questions of medical, social, and scientific importance.
5 Code and Data Availability
Our results are based on the dataset [15], which is continually updated online, and we will update our results based on the new data available, and the results will be updated continually in our online (COVID-19 repository), together with the Matlab code to process the data and perform CPIR analysis.
7 Conflict of Interest
The authors declare no conflict of interest.
6 Acknowledgments
This work was funded in part by the Army Research Office, and also DARPA.