PT - JOURNAL ARTICLE AU - Jonathan Lu AU - Amelia Sattler AU - Samantha Wang AU - Ali Raza Khaki AU - Alison Callahan AU - Scott Fleming AU - Rebecca Fong AU - Benjamin Ehlert AU - Ron C. Li AU - Lisa Shieh AU - Kavitha Ramchandran AU - Michael F. Gensheimer AU - Sarah Chobot AU - Stephen Pfohl AU - Siyun Li AU - Kenny Shum AU - Nitin Parikh AU - Priya Desai AU - Briththa Seevaratnam AU - Melanie Hanson AU - Margaret Smith AU - Yizhe Xu AU - Arjun Gokhale AU - Steven Lin AU - Michael A. Pfeffer AU - Winifred Teuteberg AU - Nigam H. Shah TI - Considerations in the Reliability and Fairness Audits of Predictive Models for Advance Care Planning AID - 10.1101/2022.07.10.22275967 DP - 2022 Jan 01 TA - medRxiv PG - 2022.07.10.22275967 4099 - http://medrxiv.org/content/early/2022/07/12/2022.07.10.22275967.short 4100 - http://medrxiv.org/content/early/2022/07/12/2022.07.10.22275967.full AB - Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audits in practice.Following guideline recommendations, we conducted a reliability audit of two models based on model performance and calibration as well as a fairness audit based on summary statistics, subgroup performance and subgroup calibration. We assessed the Epic End-of-Life (EOL) Index model and an internally developed Stanford Hospital Medicine (HM) Advance Care Planning (ACP) model in 3 practice settings: Primary Care, Inpatient Oncology and Hospital Medicine, using clinicians’ answers to the surprise question (“Would you be surprised if [patient X] passed away in [Y years]?”) as a surrogate outcome.For performance, the models had positive predictive value (PPV) at or above 0.76 in all settings. In Hospital Medicine and Inpatient Oncology, the Stanford HM ACP model had higher sensitivity (0.69, 0.89 respectively) than the EOL model (0.20, 0.27), and better calibration (O/E 1.5, 1.7) than the EOL model (O/E 2.5, 3.0). The Epic EOL model flagged fewer patients (11%, 21% respectively) than the Stanford HM ACP model (38%, 75%). There were no differences in performance and calibration by sex. Both models had lower sensitivity in Hispanic/Latino male patients with Race listed as “Other.”10 clinicians were surveyed after a presentation summarizing the audit. 10/10 reported that summary statistics, overall performance, and subgroup performance would affect their decision to use the model to guide care; 9/10 said the same for overall and subgroup calibration. The most commonly identified barriers for routinely conducting such reliability and fairness audits were poor demographic data quality and lack of data access. This audit required 115 person-hours across 8-10 months.Our recommendations for performing reliability and fairness audits include verifying data validity, analyzing model performance on intersectional subgroups, and collecting clinician-patient linkages as necessary for label generation by clinicians. Those responsible for AI models should require such audits before model deployment and mediate between model auditors and impacted stakeholders.Contribution to the Field Statement Artificial intelligence (AI) models developed from electronic health record (EHR) data can be biased and unreliable. Despite multiple guidelines to improve reporting of model fairness and reliability, adherence is difficult given the gap between what guidelines seek and operational feasibility of such reporting. We try to bridge this gap by describing a reliability and fairness audit of AI models that were considered for use to support team-based advance care planning (ACP) in three practice settings: Primary Care, Inpatient Oncology, and Hospital Medicine. We lay out the data gathering processes as well as the design of the reliability and fairness audit, and present results of the audit and decision maker survey. We discuss key lessons learned, how long the audit took to perform, requirements regarding stakeholder relationships and data access, and limitations of the data. Our work may support others in implementing routine reliability and fairness audits of models prior to deployment into a practice setting.Competing Interest StatementSP is currently employed by Google, with contributions to this work made while at Stanford. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.Funding StatementJL was funded by a Stanford University School of Medicine MedScholars grant. The study was supported by the Stanford Medicine Program for AI in Healthcare which is funded by a gift from Debra and Mark Leslie as well as the Department of Medicine and Stanford Healthcare. AS, MS and Steven Lin were funded by the Division of Primary Care and Population Health at Stanford University School of Medicine and received no external funding for this work. SW, ARK, AC, RCL, LS, KR, MFG, SC, KS, NP, PD, AG, and MAP did not receive any external funding for this work. SF was funded by a Stanford Graduate Fellowship and a National Defense Science and Engineering Graduate Fellowship. BE was funded by the T15 Training Grant from the National Library of Medicine. RF, BS, MH and WT were funded by the Serious Illness Care Program at Stanford University School of Medicine and received no external funding for this work. SP was funded by R01 HL144555 from the National Heart, Lung, and Blood Institute (NHLBI) and the Debra and Mark Leslie Fund for the Stanford Medicine Program for AI in Healthcare. Siyun Li was funded by the Siebel Scholars program from the Thomas and Stacey Siebel Foundation and the Debra and Mark Leslie Fund for the Stanford Medicine Program for AI in Healthcare. YX was funded by R01 HL144555 from the National Heart, Lung, and Blood Institute (NHLBI).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB 42078: Advanced Analytics to Improve Palliative Care Access and IRB 57916: Analyzing STARR OMOP clinical data for the learning health system of Stanford University gave ethical approval for this work.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data sets containing summary statistics and model performance and calibration metrics for this study can be found in the following attached Zip folder: https://drive.google.com/file/d/1-VB-kuvK2dFy5evCZbS2qe6DkWXmISv_/view?usp=sharing https://drive.google.com/file/d/1-VB-kuvK2dFy5evCZbS2qe6DkWXmISv_/view?usp=sharing