Abstract
Multiplex panel tests identify many individual pathogens at once, using a set of component tests. In some panels the number of components can be large. If the panel is detecting causative pathogens for a single syndrome or disease then we might estimate the burden of that disease by combining the results of the panel, for example determining the prevalence of pneumococcal pneumonia as caused by many individual pneumococcal serotypes. When we are dealing with multiplex test panels with many components, test error in the individual components of a panel, even when present at very low levels, can cause significant overall error. Uncertainty in the sensitivity and specificity of the individual tests, and statistical fluctuations in the numbers of false positives and false negatives, will cause large uncertainty in the combined estimates of disease prevalence. In many cases this can be a source of significant bias. In this paper we develop a mathematical framework to characterise this issue, present novel statistical methods that adjust for this bias and quantify uncertainty, and use simulation to test these methods. As multiplex testing becomes more commonly used for screening in routine clinical practice, accumulation of test error due to the combination of large numbers of test results needs to be identified and corrected for.
Author summary During analysis of pneumococcal incidence data obtained from serotype specific multiplex urine antigen testing, we identified that despite excellent test sensitivity and specificity, the small error rate in each individual serotype test has the potential to compound and cause large uncertainty in the resulting estimates of pneumococcal prevalence, obtained by combining individual results. This limits the accuracy of estimates of the burden of disease caused by vaccine preventable pneumococcal serotypes, and in certain situations can produce marked bias.
Introduction
Multiplex panel testing is a convenient and rapid diagnostic approach and is increasingly being used in clinical practice to differentiate between viral and bacterial causes of a range of disorders [1]. It has also been used in epidemiological studies to identify pneumococcal subtypes targeted by vaccines [2] or monitor disease spread [3]. Multiplex panel tests have been developed for a wide range of clinical syndromes caused by different pathogens, or for specific diseases caused by different subtypes of the same pathogen [1], and may be based on immunological [4, 5] or genetic techniques [6–11]. The number of targets tested for in each multiplex are increasing, but range from a handful, up to 48 different causative agents [3]. In this paper we demonstrate that when large multiplex panels are used, even small errors in the component tests can cause significant compound error and potential bias if the results are combined, usually leading to an overestimate of the prevalence of the combined condition.
In the schematic in Fig 1, we distinguish between multiplex testing (subfigures A-D) and other types of multiple testing (subfigures E-G). Subfigures A-D show two component tests which identify each of two subtypes of disease. The disease subtypes are present independently of each other and the disease super-type is present if any of the subtypes is present (B-C). In panel A we see that a false positive in one component, results in a false positive in the combined panel. In subfigure B one subtype is correctly detected, in C the other subtype, and in subfigure D a false positive result for one subtype and a false negative for the other results in an overall result which is correct for the wrong reason. In all subfigures A-D, the combined test result would be interpreted as positive. As described above, this design of test is usually extended to many more than two subtypes to make a multiplex panel.
Figures 1 E-H show a different test design which is more related to multiple modalities of testing [12]. In this situation, the multiple tests are looking for the same underlying cause of disease which does not have subtypes. In Figure /reffig1 E, both tests are true negatives and the overall result also a true negative. The interpretation of the two tests can be: a) that any single test being positive infers disease, in which case all subfigures F-H show positive combined results, or b) that both tests must be positive to identify the disease, in which case only subfigure H represents a positive result. These are not regarded as multiplex tests.
In more formal language, we define a multiplex test as consisting of a set of independent components which test different independent hypotheses, the results of which are combined to give a panel result where a positive test result in any component implies a positive test result in the panel. From this point, only multiplex panel tests will be discussed.
If a condition is composed of many subtypes, then each individual subtype must be a fraction of the overall condition prevalence. The more subtypes in a multiplex panel, the smaller that fraction will be, without loss of generality. If the prevalence of each component is low, then each component test is operating at a level where the positive predictive value of the test (i.e. the probability that a positive test result represents a true positive rather than a false positive) is also relatively low. This leads to a high probability of observing false positives in each component. We will also observe false negatives depending on the sensitivity of the test, but if the prevalence of a subtype is low, there are fewer true positives to be missed.
The effect of this can be seen in Fig 2 where we look at the theoretical distribution of false negatives and false positives in 1000 tests for three hypothetical disease subtypes, present at 2%, 0.5% and 0% prevalence, assuming a test with high specificity of 99.75% and moderate sensitivity of 80%. At 2% prevalence, false positive test results are likely to be balanced by the false negatives (Fig 2 A) and the expected test positivity is expected to be lower than 2%, the true value of prevalence in this simulation, (Fig 2 B and C). When the prevalence of the subtype is lower, at 0.5%, this pattern is reversed, and the false positives will tend to outweigh the false negatives (Fig 2 D) leading to a higher test positivity than prevalence (Fig 2 E and F). In the 0% scenario (Fig 2 G,H and I) all positives are by definition false positives, distributed with high variance leading to a test positivity above 0.
If a multiplex panel which consists of 20 subtypes is applied to a disease which is present at a prevalence of 10%, then it is reasonable to expect that the three patterns in Fig 2 will be present in some combination. The components have a mix of false positives and false negatives, in a manner dependent on the distribution of disease subtypes. In this particular scenario (20 highly specific tests at 10% prevalence) the balance of these will be towards false positives. Because any positive component results in a positive panel result, the component false positive errors compound in combination. In this example the error combines in such a way that the panel result will contain more false positives than false negatives, and the resulting test positivity rate will be an overestimate of true prevalence.
The compounding of error in numerous components is analogous to parallel testing of multiple statistical hypotheses. In this situation, a Bonferroni correction is often used to reduce the risk of over-interpreting the results of statistical tests of significance [13]. In a similar way, results from parallel testing of disease sub-types are at risk of being be over-interpreted without a clear understanding of the nature of test errors.
In the remainder of this paper we quantify this risk, and summarise the mathematical properties of multiplex tests. We use a realistic simulation based on the example of pneumococcal serotypes to demonstrate the implications and study potential mitigation strategies. In S1 Appendix we provide the detail of the mathematical analysis, and validate our findings against a broad range of simulation scenarios. In S2 Appendix we provide specific detail on propagation of uncertainty associated with combined multiplex panel testing, and validate this against a set of realistic simulations. Supporting implementations of all methods described here are provided in S3 R package.
Materials and methods
In this section we describe the mathematical analysis, the methods used to adjust for potential bias and uncertainty, and the simulations used to test and illustrate the problem. The majority of the detailed methods are found in S1 Appendix and S2 Appendix. The equations presented here are for ease of reference and are not essential to the remainder of the analysis presented in this summary paper.
Mathematical analysis and validation
Given a set of N multiplex panel component tests, the combined test result is defined as positive if any of the panel component tests are positive. For a specific patient k this is represented by the following expression, where I is an indicator function and O is observed test positivity. The test positivity rate (or apparent prevalence: ) for the panel result of N tests for a group of K patients is given by: A panel result is positive if any component result is positive, and in S1 Appendix we show that a true negative panel result can only be the result of a combination of true negative component results. From this we go on to determine estimates of sensitivity and specificity expressions for combined panels as shown below. In Eq 3 and 4, is the apparent prevalence (test positivity rate) for the component tests. sensn and sensN is the sensitivity of the components and combined panel, with specN and specn as the specificity. From this, we use the Rogan-Gladen estimator of true prevalence [14], to derive expressions for the true prevalence of a combined panel based on the test positivity, sensitivity and specificity of the components. In S1 Appendix these estimators are demonstrated to perform well in a broad range of scenarios based on randomly generated synthetic multiplex panels, and the behaviour of these estimators is analysed in detail.
Application to realistic situations
To illustrate the implications of multiplex test error for epidemiological studies, we have constructed a simulation based on pneumococcal serotypes, to demonstrate uncertainty and risk of bias that could occur in studies that investigate the overall burden of pneumococcal disease using multiplex testing.
We previously published the frequency of the 20 pneumococcal serotypes contained in the 20-valent pneumococcal conjugate vaccine (PCV20), that were identified in an invasive pneumococcal disease (IPD) cohort in Bristol between January 2021 and December 2022 [15]. This IPD distribution was scaled to give a realistic distribution of 20 subtypes in a hypothetical population with an overall PCV20-type pneumococcal prevalence of 10%. We simulate testing this population with a hypothetical multiplex panel which detects the 20 individual serotypes. For illustration purposes, we assume all component tests of the multiplex panel are moderately sensitive (80%) and highly specific (99.75%), (these assumptions are loosely based on existing serotype specific detection tests). The simulated test results for individual serotypes were aggregated into a PCV7 group (any positive of serotypes 4, 6B, 9V, 14, 18C, 19F, 23F), a PCV13 group (PCV7 groups plus 1, 3, 5, 6A, 7F, 19A), a PCV15 group (PCV13 plus 22F and 33F), and a PCV20 group (all serotypes). This allows us to compare “true” simulation prevalence to test positivity rates (apparent prevalence). Using the estimators for panel sensitivity and specificity above, we use the synthetic data set to estimate the true prevalence from test positivity, of both components and panels. With the same basic simulation we vary component test sensitivity and specificity, and investigate how the difference between “true” simulation prevalence (10%) and simulated test positivity rates (apparent prevalence) depends on test performance in a realistic scenario.
Uncertainty propagation
Our mathematical analysis assumes precisely known values for the specificity and sensitivity of component tests. However, these quantities can only be estimated as a result of control-group testing. Because individual subtypes are usually present at low levels when there are multiple subtypes, the number of positive disease controls for any given subtype is typically small [2]. This places a limit on the precision of estimates of component test sensitivity, which in turn makes interpretation of test positivity in both components and panels challenging.
For single tests, there are approaches to estimating true prevalence from test positivity, which incorporate uncertainty in sensitivity and specificity, in both frequentist [16–18] and Bayesian frameworks [18–20]. In S2 Appendix we extend these two frameworks to account for multiplex testing, and implement a third resampling procedure combined with the Rogan-Gladen estimator to propagate uncertainty. We test this against a synthetic data set that is based on the IPD distribution scaled to an overall pneumococcal prevalence of 10% (further described in S2 Appendix). These methods are implemented as an R package “testerror” in S3 R package.
Results
In the illustrative simulation motivated by IPD serotype distributions, the serotypes range from having no observed cases to making up 25.6% of the total [15]. When this is scaled to a synthetic population with 10% overall prevalence, the component prevalence ranges from 0% to 3.8% and, as with the theoretical examples in Fig 2 D-I, the majority of serotypes fall into the category where the apparent prevalence is higher than the true prevalence due to false positives, despite assuming a highly specific test with 99.75% specificity (Fig 3 A). The bias towards overestimation due to false positives is strongest for subtypes with low, or zero, prevalence, whereas the underestimation due to low sensitivity is strongest for subtypes with higher prevalence (also demonstrated in Fig 2 A-C).
In the synthetic but realistic scenario in Fig 3 A, with excellent test specificity (99.75%) and moderate test sensitivity (80%), test positivity rate (apparent prevalence) is expected to be higher than true prevalence under a threshold of 1.2%. When a set of 20 components are combined, that together result in a true panel prevalence of 10%, the combined errors mean that the panel test positivity is higher than the true prevalence (Fig 3 B, dashed black lines). In Fig 1 D and S1 Appendix we identify that false positives in one test balance out false positives in another test, and this makes panel test sensitivity a complex quantity that counter-intuitively depends on disease prevalence, component distribution, sensitivity and specificity. As a result, the relationship between true panel prevalence and apparent panel prevalence (test positivity) is non-linear (Fig 3 B), and in this particular simulation, test positivity will be an over-estimate of true prevalence, until true prevalence exceeds 22%.
Component sensitivity and specificity determine the difference between true and apparent prevalence as shown in Fig 4. This considers the same scenario of 10% prevalence, but shows the relative difference between true and apparent prevalence when varying sensitivity and specificity. The previous assumptions are marked as a blue cross in the figure, and at this high level of specificity (i.e. 99.75% - right dotted vertical line in Fig 4) the ratio between apparent and true prevalence is mostly influenced by test sensitivity. If sensitivity is low enough (less than 50%) the false negative rate exceeds the combined false positive rate and apparent prevalence is smaller than true prevalence. In any situation where the specificity is lower, the balance of error is most influenced by test specificity, and test sensitivity becomes much less important as a factor determining the difference between true and apparent prevalence. Even marginally lower values of test specificity result in test positivity being a gross overestimate of panel prevalence. If the component test specificity is only 98% (left dotted line) the combined 2% false positive rate of 20 components is sufficient to drive the overall panel test positivity to 4 times the level of the true prevalence set in this simulation, regardless of the test sensitivity.
We have described that even low false positive rates in component tests lead to overestimates of uncommon components. The converse is true for components with comparatively high prevalence. In the scenario we have been using as an example, despite the excellent specificity of the tests and 10% overall prevalence the balance of the component estimates is such that test positivity will overestimate true prevalence. This is seen more clearly in Fig 5 (left subfigure) in which simulated true prevalence levels (blue) are lower than test positivity (red) for all but two of the components (serotypes 3 and 8). In the right subfigure we see the effect of combining these into groups of 7, 13, 15 and 20 components, representing combinations of serotypes targeted by vaccines. As predicted, overestimates of prevalence are compounded and the size of each overestimate depends both on the number and distribution of test components.
In S2 Appendix we describe methods for correcting this bias in both frequentist and Bayesian frameworks using results from the mathematical analysis (S1 Appendix). In Fig 5 the Bayesian correction is applied and we are able to correctly predict the true prevalence (blue) allowing for uncertainty in our knowledge of test sensitivity and specificity. This is examined in a broader range of scenarios in S2 Appendix but in summary both Bayesian and Lang-Reiczigel (frequentist) approaches work well when we have good prior information about test sensitivity and specificity, but if these assumptions are very wrong, then we cannot expect either method to produce accurate estimates.
Discussion
Combining multiplex test results into a panel commonly results in test positivity that significantly overestimates true prevalence. Multiplex testing simultaneously tests many hypotheses, and by combining the result into a single panel result leads to compounding of error. This error can be significant because of the low positive predictive value of individual component tests operating at low pre-test probability. This is critically dependent on component test specificity, and very high specificity is essential in tests which are designed to be interpreted as a combined result.
Panel test sensitivity is difficult to characterise. When multiplex tests are combined, components with a larger pre-test probability will generate more false negatives. In panel tests, false negative results in one component are over-ridden by any positives in other components. The specificity of the overall panel test is therefore a complicated function of component test sensitivity, specificity and pre-test probability (component prevalence), leading to higher panel sensitivity at higher prevalence. This is counter-intuitive as test sensitivity is usually regarded as independent of prevalence. This makes it challenging to compare panel test positivity rates in populations with different prevalence.
It remains possible to estimate true prevalence from test positivity, despite the complexities around panel test specificity and sensitivity. Positivity estimates generated by panel tests can be significantly biased and the expected value of test positivity is not a binomially distributed quantity (as demonstrated in Fig 2) so we cannot infer confidence intervals from an observation. The raw test positivity / apparent prevalence of a panel test is therefore very hard to interpret. We recommend use of the techniques described in this paper to produce modelled true prevalence estimates with confidence limits.
Sensitivity and specificity assumptions that incorporate uncertainty are critical in producing accurate modelled true prevalence estimates. Specificity estimates for multiplex testing usually rely on a disease free control group, which may also be used to determine cut points to achieve set specificity levels, and can usually give us a reasonable estimate of component test specificity. Determining the sensitivity of the components of a multiplex test is much harder as it needs proven cases of disease with known subtype. These are difficult to find for rare disease subtypes, and gold standard identification of disease subtypes is not always available, or free from error [21, 22]. This results in a great deal of uncertainty in estimates of component test sensitivity. In some situations panel test sensitivity is estimated directly, however as we saw above, panel test sensitivity is dependent on a range of factors including overall prevalence, and component distribution. Any direct estimates of panel sensitivity are not generalisable outside of the specific population tested. The methods presented here for modelling true prevalence from multiplex tests do allow for the uncertainty in sensitivity and specificity to be propagated appropriately. The accuracy of this correction, however, is dependent on the quality of the estimates of specificity and sensitivity (see S2 Appendix), and complete mis-specification of either quantity prevents correct estimation of true prevalence. To improve accuracy and narrow the confidence intervals of estimates of prevalence it is far more important to characterise the sensitivity and specificity of the test than increase the sample size of testing. With a poorly understood test it is hard to draw any conclusions from the results.
The bias in panel test positivity is an inevitable consequence of combining multiple tests in environments with moderate to low prevalence. It can be mitigated in a number of ways: a) the specificity of the component tests is increased, b) second line confirmatory testing is performed, c) the multiplex test can only be applied to populations with a very high overall disease prevalence. In the last case we may be able to use a multiplex test to determine which subtype of disease is causative if we already know the patient has the disease by using a different test, or using specific clinical diagnostic criteria that select patients with high probability of disease.
There are analagous situations where multiplex panel tests are used with similar potential risks. For example the Biofire FilmArray™respiratory panel 2.1 is one of a number of multiplex panels directed at respiratory pathogens [1]. It detects 19 viruses [21, 23]. We have trialled using this in Bristol to investigate co-infection of respiratorypathogens. There are multiple comparative evaluations of the Biofire FilmArray™panel [7, 21, 22, 24–26] but there has not yet been a large scale evaluation of test specificity using disease free controls for each individual panel. Identifying a patient as having co-infection by any of the 19 viral diseases in the panel, requires similar adjustment for the combined test uncertainty of all of the panel components to estimate co-infection frequency.
Conclusion
In this paper we have characterised the degree of uncertainty that results if multiplex panel test results are combined to give an overall result. The principal example of this is pneumococcal disease, in which specific component tests of a urine antigen detection test (UAD) identify up to 24 individual pneumococcal serotypes [2, 4]. This is designed to be highly specific with individual serotype tests being around 99.75%. The serotypes are generally grouped together by the vaccines that target them, to determine vaccine preventable disease, or all together as an estimate of pneumococcal disease burden [15]. This use of multiplex UAD testing is susceptible to the uncertainty and biases described in this analysis. Even considering the highly specific nature of the UAD tests [4], as the number of components increases so does the risk of bias. Any seemingly minor decrease in test specificity is expected to have a large impact on estimates of disease burden. Despite excellent specificity, without correction, the large number of tests in the panel creates uncertainty in prevalence estimates using UAD tests, and difficulty in comparing results to those of other similar studies. In this analysis we present methods to correct and quantify uncertainty in prevalence estimates using multiplex panels such as the UAD. These methods are a useful tool but critically rely on estimates of test sensitivity and specificity, and without these it is very hard to estimate disease burden using UAD results.
Uncertainty in test results due to lower sensitivity and specificity result in more noise at lower levels of prevalence [27, 28]. In vaccine effectiveness studies using a test negative design this phenomenon acts to mask the effect of a vaccine in the lower prevalence vaccinated group. Hence test error always results in an underestimate of vaccine effectiveness [28]. The less sensitive the test, the greater this underestimate. For pneumococcal vaccination, the serotype of pneumococcal disease is determined using urine antigen detection (UAD) test panels [2, 4]. Theory suggests that, because of the issues identified here, conclusions on vaccine effectiveness based on the UAD tests are an underestimate [28]. The underestimate of vaccine effectiveness helps mitigate any bias resulting from test error in disease burden estimates, and hence the anticipated impact of a vaccine in the real world may be relatively unaffected. Further work would be needed to formally assess this.
Data Availability
All data and code produced are available online at https://github.com/bristol-vaccine-centre/testerror
Supporting information
S1 Appendix. Sensitivity and specificity of combined panel tests. Derivation of the performance metrics and true prevalence adjustments for combination tests.
S2 Appendix. Propagation of uncertainty of combined panel tests. Bayesian and frequentist approaches to estimating the uncertainty of panel test results.
S3 R package. testerror: Uncertainty in Multiplex Panel Testing. R package providing methods to support the estimation of epidemiological parameters based on the results of multiplex panel tests, doi:10.5281/zenodo.7691196.
Funding
We would like to acknowledge the help and support of the JUNIPER partnership (MRC grant no MR/X018598/1) which RC and LD and are affiliated with. KTA gratefully acknowledges the financial support of the Engineering and Physical Sciences Research Council (EPSRC) via grant EP/T017856/1. CH was funded by the National Institute for Health Research (NIHR) via an Academic Clinical Fellowship (ACF-2015-25-002). The views expressed are those of the authors. Funding for the AvonCAP study was provided by Pfizer, however, the manuscript development and the analysis that is the subject of this manuscript were conducted independently of Pfizer.
Declarations
CH is Principal Investigator of the AvonCAP study which is an investigator-led University of Bristol study funded by Pfizer. AF is a member of the Joint Committee on Vaccination and Immunization (JCVI). He receives research funding from Pfizer as Chief Investigator of the AvonCAP study and he leads another project investigating transmission of respiratory bacteria in families jointly funded by Pfizer and the Gates Foundation. RC, AC, GQ, GO, RK, and LD receive research funding from Pfizer via the AvonCAP study.
Contributions
RC and LD generated the research questions. RC, KT, LD performed the mathematical analysis and simulations, and RC created the supporting software package. LD and AF provided oversight of the research. All authors contributed to the preparation of the manuscript and its revision for publication and had responsibility for the decision to publish.