Abstract
There is considerable interest in developing computational models capable of detecting rare disease patients in population-scale databases such as electronic health records (EHRs). Deriving these models is challenging for several reasons, perhaps the most daunting being the limited number of already-diagnosed, ‘labeled’ patients from which to learn. We overcome this obstacle with a novel lightly-supervised algorithm that leverages unlabeled and/or unreliably-labeled patient data – which is typically plentiful – to facilitate model induction. Importantly, we prove the algorithm is safe: adding unlabeled/unreliably-labeled data to the learning procedure produces models which are usually more accurate, and guaranteed never to be less accurate, than models learned from reliably-labeled data alone. The proposed method is shown to substantially outperform state-of-the-art models in patient-finding experiments involving two different rare diseases and a country-scale EHR database. Additionally, we demonstrate feasibility of transforming high-performance models generated through light supervision into simpler models which, while still accurate, are readily-interpretable by non-experts.
Competing Interest Statement
The authors are shareholders in Volv Global.
Funding Statement
Volv Global provided support for the research and studies reported in the submitted manuscript.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The use of the PHARMO data is controlled by the independent Compliance Committee STIZON/PHARMO Institute. All decisions of the Compliance Committee STIZON/PHARMO Institute are based on the applicable legislation in the Netherlands, e.g. the Personal Data Protection Act and the Medical Treatment Contract Act. Within this legal framework, the Code of Conduct 'Use of Data in Health Research' is an important document for the interpretation of the use of this kind of data for scientific research in the Netherlands, and is approved by the Dutch Data Protection Authority.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
The electronic health record data used in this work is protected by Dutch Privacy Regulations, and cannot be made publically available.