Machine learning predicts liver cancer risk from routine clinical data: a large population-based multicentric study =================================================================================================================== * Jan Clusmann * Paul-Henry Koop * David Y. Zhang * Felix van Haag * Omar S. M. El Nahhas * Tobias Seibel * Laura Žigutytė * Apichat Kaewdech * Julien Calderaro * Frank Tacke * Tom Luedde * Daniel Truhn * Tony Bruns * Kai Markus Schneider * Jakob N. Kather * Carolin V. Schneider ## Abstract **Background and aims** Hepatocellular carcinoma (HCC) is a highly fatal tumor, for which early detection and risk stratification is crucial, yet remains challenging. We aimed to develop an interpretable machine-learning framework for HCC risk stratification based on routinely collected clinical data. **Methods** We leverage data obtained from over 900,000 individuals and 983 cases of HCC across two large-scale population-based cohorts: the UK Biobank study and the “All Of Us Research Program”. For all of these patients, clinical data from timepoints years before diagnosis of HCC was available. We integrate data modalities including demographics, electronic health records, lifestyle, routine blood tests, genomics and metabolomics to offer a unique, multi-modal perspective on HCC risk. **Results** Our random-forest-based model significantly outperforms all publicly available state-of-the-art risk-scores, with an AUROC of 0.88 both for internal and external test sets. We demonstrate robustness of our model across ethnic subgroups, a major advance over previous models with variable performance by ethnicity. Further, we perform extensive feature-importance analysis, showcasing our approach as an interpretable framework. We provide all model weights and an open-source web calculator to facili-tate further validation of our model. **Conclusion** Our study presents a robust and interpretable machine-learning framework for HCC risk stratification, which offers the potential to improve early detection and could ultimately reduce disease burden through targeted interventions. **Lay summary** Finding liver cancer early is crucial for successful treatment. Therefore, screening with abdominal ultra-sound can be performed. However, it is not clear who should receive ultrasound screening, as with the current standard of screening only patients with liver cirrhosis, a severe liver disease, many patients are diagnosed with liver cancer in late stages. Therefore, we trained a machine learning model, acting like many decision trees at the same time, to detect patients with high risk of liver cancer by looking at patterns of almost 1000 cases of liver cancer in a population of 900.000 individuals. In a separate set of patients, which the model has not seen during training, our model worked better than all available models. Additionally, we investigated 1. how the model comes to its prediction, 2. whether it works in males and females alike and 3. which data is most relevant for the model. Like this, our model can help sort patients into categories like “high-risk”, “medium-risk” and “low-risk”, via which screening strategies can then be decided, to help improve early detection of liver cancer. Keywords * Machine learning * hepatocellular carcinoma * risk stratification * early detection * big data * population-cohort * prediction * screening ## Introduction Hepatocellular carcinoma (HCC) is the fifth most common malignancy and third leading cause of cancer-associated death. Its incidence has tremendously increased in the last years, high-lighting HCC as a major public health concern, especially in structurally disadvantaged re-gions, locally and globally 1–3. Current screening protocols for HCC predominantly target patients already diagnosed with liver cirrhosis and rely heavily on resource-intensive imaging technologies, which are not uni-versally accessible. Despite screening, the majority of HCC cases are diagnosed only at late disease stages, drastically worsening prognosis 4. Additionally, these protocols fail to consider multifactorial risk factors such as lifestyle, pre-existing health conditions, blood tests or omics signatures, which could identify a broader at-risk population 5–8. This gap is particularly critical as the prevalence of metabolic dysfunction-associated steatotic liver disease (MASLD) and its related HCC cases continue to rise 9. This highlights the need for a screening strategy that is more affordable, more inclusive and more efficient at detecting patients with high risk of HCC. Imaging-based screening could then be specifically targeted at this target group. Integrating the vast array of patient data to assess disease risk represents a significant chal-lenge in modern medicine, particularly as screening is typically conducted by general practi-tioners who must possess a broad knowledge base across a wide range of diseases, each with its own set of unique risk factors. Machine learning (ML) algorithms present a promising solution to this challenge, especially for HCC, and could further reduce healthcare disparities, if developed cautiously 10. Several studies have shown the superiority of ML algorithms over traditional regression analyses or few-parameter based decision-support systems as imple-mented in current guidelines 1,11–17. However, a major criticism is the lack of generalizability due to small cohorts, lack of published model weights, a lack of external testing 18, as well as usage of highly curated, retrospective datasets that do not necessarily represent real-world scenarios. In response to these challenges, we developed a novel, non-invasive multimodal screening tool that integrates these diverse risk factors into a comprehensive risk stratification model. Our study uses the extensive data available from the UK Biobank (UKB) and the “All Of Us Research program” (AOU), both of which include information extensive questionnaires, elec-tronic health records, blood parameters and genomics for each over 500,000 participants, with ongoing follow-up since 2006 for UKB 19–22 and extensive retrospective documentation for AO “High”-threshold) or to rule-out for screening (when < “Low”-threshold). We show that integration of multimodal data into ML models outperforms previous approaches for risk stratification of HCC development. ## Methods ### UK Biobank Cohort The present investigation leverages data from the UKB, an expansive prospective cohort study initiated between 2006 and 2010. The UKB enrolled a diverse population of individuals aged 37-73 years, en-compassing a total of 502,411 participants, with 502,309 participants currently available due to removal of consent of 102 participants. Initial evaluations were comprehensive, incorporating an array of anthro-pometric measurements, biospecimen collection, and the administration of multiple structured ques-tionnaires. Ethical approval for the UKB study was granted by the Northwest Multi-Center Research Ethics Committee. Comprehensive methodological details, data accessibility, and acquisition protocols are publicly disclosed on the UKB’s official website ([http://www.ukbiobank.ac.uk](http://www.ukbiobank.ac.uk)). Participants were ac-tively encouraged to partake in an inaugural clinical assessment, which was subsequently followed by longitudinal monitoring. All enrollees provided informed consent for genotypic analyses and the longi-tudinal linkage of their data to medical records. Periodic follow-up evaluations have been instituted to monitor alterations in health status and lifestyle variables. ### Primary survey data Baseline characteristics were obtained through structured questionnaires. These self-reported measures included lifestyle factors such as alcohol consumption, smoking history, as well as medical history encompassing liver disease, diabetes, and obesity. Estimated values (Alcohol consumption per day; Pack years) were limited to the 99.9th percentile to minimize the influence of extreme outliers while preserving the overall distribution of the data. Physiologically plausible limits were set to further curate the dataset (Supplementary Table 3). Ethnicities in the “All Of Us Research Program” were inferred from questionnaires “Self-reported race” and “self-reported ethnicity”. ### Electronic health records Electronic health records of UKB, documented by diagnosis codes at baseline assessment, were merged with EHR through ICD-Mapping (Supplementary Table 8), available via UKB through “hesin” and “hesin_diag”. Patients with an EHR entry of HCC before or in the first year after their first UKB visit were excluded from analysis (n=36) (Figure 2a) to prevent training on data of undetected HCC cases. Records were supplemented with UK death registers, with C220 documented here being treated equal to encoding via EHR. Training was performed on this data, while evaluation was restricted to all patients whose diagnosis of HCC was confirmed by data from the UK National Cancer Registry (UKB Resource 115558). EHR for a subset of 136 ICD codes (Supplementary Table 2, 3), consisting of a set of chronic liver diseases or signs of portal hypertension, cardiovascular risk factors, and gastrointestinal cancers except hepatobiliary cancers were included after performance of phenome wide association studies (Figure 2d). To prevent leakage of future diagnosis to the training data, only diagnosis codes up until 5 years after initial UKB visit were collected as features. EHR of “All Of Us” are curated from a variety of sources and then harmonized using the Observational Medical Outcomes Partnership (OMOP) Com-mon Data Model to be stored in data dictionaries. Specific data relevant to our study were then extracted on their workbench platform using built-in cohort and dataset builders. Diagnostic instances of all rele-vant ICD codes for each individual were tabulated along with shifted dates, as any one individual may have the same diagnosis code recorded in their health record multiple times from multiple medical visits. ![Fig. 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/04/2024.11.03.24316662/F2.medium.gif) [Fig. 2:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/F2) Fig. 2: UK Biobank encompasses representable risk factors of HCC **a** Histogram of first occurrences of HCC diagnosis in EHR in UKB. Cases discarded for analysis shown in dark gray (exclusion criteria see Methods). **b** Distribution of HCC cases in UK Biobank centers on a map of Great Britain, displayed by country association and mapping to training or validation cohort, mapped as number of cases per 500.000 **c** Prevalence of most common disease codes in EHR or self-reported for HCC cohort, patients-at-risk cohort and control group. Sorted by highest prevalence in the HCC group. **d** Ranking of etiologies of HCC cases (n=538) with total numbers + percentage. Bars indicate timepoints, with “First EHR” being the first-ever NHS-doc-umented hospital stay, “Prior to HCC” the last stay before HCC diagnosis and “0-2y after HCC” indicating first diagnosis in the first two years after HCC diagnosis. No LD = No Liver Disease, CLD = Chronic Liver disease (except Cirrhosis or Viral Hepatitis) **e** Manhattan plot of phenome-wide-associations for individuals with HCC (C22.0) vs propensity-matched control population (on age, sex, bmi) (with P values (−log10). Highlighted are as-sociations with p values ≤ 0.05 (corrected for multiple testing by Bonferroni-correction. **f** Associations of blood biomarkers with HCC in CLD, displayed as Odds ratios (OR) and 95% confidence intervals for a one-standard deviation increase in each biomarker on the natural log scale, adjusted for age, sex and BMI. Significance defined as false discovery rate (FDR) controlled p < 0.01 in a linear model, marked with * and full saturation. Error bars limited to Y-limits for better readability. Mapping to UKB field IDs see Supplementary Table 5. **g** Associations of single-nucleotide polymorphisms with HCC occurrence displayed as log2-transformed Odds ratios + 95 % confi-dence interval, split between heterozygous (het) and homozygous (hom) occurrence, with color indicating direction of effect (blue = reduced risk, red = increased risk), and significance (Bonferroni-corrected p < 0.05) indicated by opacity, transparent=not-significant, opaque = significant. ### Blood count and biochemistry Blood assays in UKB (Category 100080) were obtained at baseline assessment, with biological pro-cessing explained in detail on the UKB’s official website. Extracted features listed in Supplementary Table 1, 2, 3, 8. NaN-values were mean-imputed for all features except Oestradiol and Rheumatoid factor, which were excluded from analysis due to lack of data (Oestradiol 402.879 NAs; Rheumatoid factor 437.202 NAs). Cleaned features were min-max normalized (see “Data Processing”). Elevated liver enzymes were defined as follows, with sex-specific cut-offs for males/females, respectively: aspar-tate aminotransferase (50/35 U/L), alanine aminotransferase (50/35 U/L) and Gamma glutamyltransfer-ase (60/40 U/L). Blood assays in “All Of us” are not assessed at a pre-defined baseline, but passed for every hospital visit of the participants. Values were harmonized to standardize units and then filtered to remove measurements made in an emergency or urgent care setting. Outliers more than four standard deviations from the mean were removed to filter out likely erroneous values, and then averaged over multiple instances to reduce the amount of missing data points. ### Genetic Data Genetic profiling was conducted on a subset of 488,377 individuals within the UKB cohort, specifically targeting carriers of established hepatic risk variants after comprehensive literature research (Figure 2 f, Supplementary Table 7). Genotypes were downloaded using the UKB gfetch utility. Specific genotype calls for hepatic risk variants were extracted with PLINK, including alignment and quality control. For each single nucleotide polymorphism (SNP), a linear regression model corrected on age, sex and the first five principal components of the genotypes was performed. Our analysis stratified participants into non-carriers, heterozygous carriers, and homozygous carriers of the minor allele (refer to Supplemen-tary Table 7 for details). ### Metabolomics Utilizing nuclear magnetic resonance (NMR) spectroscopy, the UKB successfully characterized the metabolomic/lipidomic signatures of 248,266 participants within the cohort (Category 220). This high-throughput approach enabled the comprehensive quantification of a wide array of metabolites (n = 143), encompassing lipids, amino acids, and small-molecule intermediates, thereby providing an intricate bi-ochemical snapshot pertinent to overall metabolic health. We extracted the 149 directly measured metabolomic features (see Supplementary Table 6), imputation and normalization procedures were performed in analogy to blood assays. All modeling analyses that included metabolomics data were performed only for the 248,266 participants for which metabolomics data was available. ### Data preprocessing UKB Data extraction and preprocessing steps were performed with RStudio (Version 2023.12.1) and are publicly available in our GitHub repository (see “Code Availability”). Summary tables were created with the gtsummary package 65. Extracted data points are listed in Supplementary Tables. Data was pro-cessed separately for the following entities: Initial assessment (Demographics, Lifestyle, Self-reported diagnosis codes), Electronic Health Records, Blood/Serum, Single Nucleotide Polymorphisms, Metab-olomics. Data was treated as independent factor level (categorical data) or imputed via mean imputation of continuous data. UKB Patients were stratified into six folds according to the assessment center (ID 54-0.0), with folds 1 to 5 as the basis for the five-fold cross-validation, and fold 6 as independent test set (Figure 2b, Supplementary Table 10). Outliers were limited to pre-defined physiological limits to allow for shared normalization between UKB and AOU to be applied (Supplementary Table 12). ### The All Of Us cohort The All Of Us cohort is a longitudinal cohort with 409.420 participants as of 10/2024. The program collects and curates clinical, biological and environmental determinants of health and disease, with procedures described in detail elsewhere 24. Health data are obtained through electronic health records and participant surveys, which are available here: [www.researchallofus.org/data-tools/survey-explorer/](https://www.researchallofus.org/data-tools/survey-explorer/). AOU gathers EHR data from > 50 healthcare organizations, data stewards at provider organizations harmonize local data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, which can then be accessed by researchers. For details on biospecimen collection and pro-cessing, see [https://allofus.nih.gov/funding-and-program-partners/biobank](https://allofus.nih.gov/funding-and-program-partners/biobank). ### Data preprocessing AOU Preprocessing was performed in line with UKB preprocessing, with the following exceptions: Alcohol consumption was approximated by frequency and amount of drinks consumed with 14g Alco-hol/drink as average. While UKB invites for a primary assessment, All Of Us is linked to hospital stays, therefore more prone to missing values. Participants with over 20 NA values were excluded from further analysis. AOU data was then converted to UKB units as documented in Supplementary Table 12. ### Computational infrastructure Preprocessing and training of UKB data was performed on local workstations with Intel i7-i9 CPUs and 32 GB of RAM. Training times ranged between 10 and 30 minutes, without GPU requirement. Training times ranged from 10 minutes to a maximum of 2 hours for the biggest model. The cross-validations are parallelized to use all CPU-cores available to the user. ### Modeling and hyperparameter tuning The modeling was conducted using scikit-learn version 1.2.2 42. We chose the random forest classifier and extreme gradient boosting as default model architectures based on literature and practicability. We split the UKB-data into 2 sets, one for training incl. cross-validation with exclusively England-based UKB assessment centers and one for testing of the model performance (Glasgow, Edinburgh, Cardiff, Swansea, Newcastle). Training was carried out in a grouped five-fold cross-validation. Herein, each of the five folds (each consisting of 3-4 assessment centers, split see Supplementary Table 10-11), served for validation once and served for training in 4 out of 5 models. A grid search evaluating different hy-perparameter combinations in the five-fold cross-validation revealed an optimal architecture with 50 estimators and a max depth of 3. This architecture was afterwards used for the training of all models to facilitate the comparison based on the input data and reduce the effect of different architectures. All five models derived from cross-validation were integrated into a majority-voting model, as an ensemble-classifier, the final model. This final model was then applied on the withheld test set (Held-out fold), as well as an external test dataset from the “All Of Us Cohort” for detailed performance and bias estima-tions. ### Interpretability Feature importance was estimated as mean and standard deviation of accumulation of the impurity decrease within each tree via the scikit-learn library. SHapley Additive exPlanations (shap-library) were used for further interpretability with bootstrapping of individual participants feature importance, con-necting optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions 53,66. ### External Testing External testing was performed in the research workbench of the “All Of Us Research Cohort” by an independent research group, with only access to the scripts, one hot encoder and model object. In-detail scripts and requirements available via our GitHub repository. ### Statistical analysis Statistical comparisons between groups were performed using Welch’s two-sample t-tests for continu-ous variables and Chi-square tests for categorical variables. For categorical variables with low cell counts (<20), Chi-square approximation may be imprecise. Results are presented as mean ± standard deviation (SD) for continuous variables and absolute numbers (percentages) for categorical variables, with values < 20 masked for AOU to protect privacy as required by AOU. All statistical analyses were adjusted for multiple testing with false-discovery-rate or Bonferroni correction. For statistical comparison of AUROCs, two-sided DeLong tests were performed with Bonferroni post hoc correction. In depth doc-umentation for all statistical analysis can be found in our GitHub repository. Statistical tests were calcu-lated in R (Tables) or in Python (Models). ### Visualization Visuals were created with RStudio 2024.04.0 as well as Python 3.9 matplotlib, scikit-learn, seaborn (0.12.2), shap 53 with all additional requirements being accessible via our GitHub repository. Figure assembly was performed with Inkscape 1.3.2, with integration of icons under common license from Microsoft, Flaticons, Bioicons, Healthicons. ## Results ### HCC cases in UK Biobank represent the general population We hypothesized that population-based cohorts are well-suited to build HCC-prediction mod-els. Firstly, they have a superior generalizability to the overall population compared to cohorts from tertiary centers and secondly, they uniquely provide rich data on phenotypic features even years before diagnosis of HCC 19,22. A total of 538 eligible HCC cases were observed in UKB (Supplementary Figure 1, Figure 2a) and 445 HCC cases in AOU, with a mean time to HCC diagnosis of 8.7 years, mean age at diagnosis of 70 ± 6.8 years in UKB and 62.1 ± 10.2 years in AOU (Supplementary Fig. 2a, b). Incidence was comparable to overall HCC incidence in the UK with 6-10 / 100,000 new cases per year (Figure 2a) 25, while more frequent in the AOU cohort (15-30 / 100.00, Figure 5a). All 22 centers of data acquisition showed similar abundance of HCC (Figure 2b, Supplementary Table 11). 399 of 538 HCC cases had a posi-tive cancer-record in respective national cancer registries. Selection of 53 questionnaire cat-egories on general and lifestyle information was based on the extensively characterized liter-ature on known risk factors of HCC (Alcohol consumption, Smoking, Social status) 26,27 (Table 1, Supplementary Table 3). Selection of disease-codes for the prediction models was based on prior phenome-wide-association studies for HCC in UKB, controlled for age, sex, bmi and Townsend deprivation index. These revealed a multitude of significant correlations (Figure 2e, Supplementary Table 1, 2, 4, 8)), expectedly associating mostly liver diseases, but also alco-holism, diabetes and many more with risk of HCC. Notably, only 31 % of included HCC pa-tients had a diagnosis of cirrhosis, viral hepatitis or other chronic liver diseases prior to HCC diagnosis, while 41 % were diagnosed with a liver disease in the two years after HCC diagno-sis, and 28 % never received a diagnosis of liver disease via ICD10 (Figure 2 c, d, Supple-mentary Table 9, 13, 14). In addition to modeling the entire cohort (All), these pre-diagnosed patients were also investigated separately as a distinct “patients-at-risk” population (PAR). Next to previous diagnosis of steatotic liver diseases, viral hepatitis, or cirrhosis, the PAR population also encompassed cases with elevated liver enzymes at baseline examination (ICD-Codes in Supplementary Table 1, 2, 9). All blood count and serum parameters available diagnosis of HCC after controlling for age, sex, BMI and chronic liver disease (Figure 2f, Sup-plementary Table 5, 15, 16). Common single nucleotide polymorphisms known to increase risk of liver diseases such as MASLD, cirrhosis or HCC were included after confirmation via linear regression analysis on UKB data (Figure 2 g, Supplementary Table 7) 6,28–35. Nucleotide magnetic resonance spectroscopy metabolomics (NMR-Metabolomics) on 248,266 partici-pants was investigated for possible correlating effects after correction for age, sex, bmi (Sup-plementary Fig. 2c, Supplementary Table 6). Of 143 metabolomic features, we observed 109 significant associations with HCC (Supplementary Table 17). View this table: [Table 1:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/T1) Table 1: Baseline characteristics of the UK Biobank study population stratified by sex and HCC status. Baseline characteristics of study participants (N = 502 309) stratified by sex and hepatocellular carcinoma (HCC) status. Data are presented as mean (± SD) for continuous variables and n (%) for categorical variables. P-values were calculated using Student’s t-test for continuous variables and chi-square test for categorical variables. All continuous variables were assessed at time of UK Biobank enrollment. Q-values represent Bonferroni-adjusted p-values to account for multiple testing. ### Stepwise study architecture mimics data availability in real-world settings We investigated predictive capacities of five different clinically relevant data modalities, namely demographics, electronic health records, bloodwork, genomics and metabolomics in two ways: first, for each modality individually, and second, using a stepwise approach reflect-ing the clinical availability of the modalities (Figure 1a). Predictive capacities were then as-sessed for both cohorts, All and PAR, resulting in a total of 20 model variations. Training and iterative hyperparameter tuning was performed in a five-fold cross-validation on data from only England-based centers of UKB, a total of 18 of 22 UKB centers. Testing of the final models was performed on untouched data of the 4 remaining centers, located either in Scotland or Wales (+ Newcastle) (n = 123 / 538) (Figure 1b, Supplementary Table 10, 11), as well as on the entire AOU cohort. Compared to more complex deep learning techniques, decision-tree based machine learning models like random forest classifiers (RFC) have the advantage of direct explainability due to feature importance metrics, while being more data-efficient than neural networks or transform-ers 36, 37. We therefore included the decision-tree based models RFC and Extreme Gradient Boosting (XGB) in initial benchmarking experiments (Supplementary Fig. 3 a, b). Both models performed similarly, however RFC performance was more consistent, especially for models trained on reduced sample sizes. As RFC is also dominant in literature reports, we selected the RFC architecture as our baseline architecture to increase comparability. A grid-search for hyperparameter optimization on the internal cross-validation set revealed slim architectures (e.g. *max-depth = 3*, *n_estimators = 50*) to be beneficial despite the large size of the dataset (see Methods section). ### Machine learning is superior to linear risk scores for prediction of HCC in the general population We hypothesized that accuracy of machine learning models would be superior to established liver-related risk assessment scores that are used to predict risk of HCC. We therefore per-formed a comparative analysis among available linear scores in the literature between aMAP 38, FIB-4 39, APRI 40 and NFS 41 scores for prediction of HCC (Supplementary Fig. 3 c, d). Herein, the aMAP-score achieved the best performance of all linear scores with an AUROC of 0.79. We therefore used the aMAP-score as our benchmark throughout model performance evaluation. Evaluation on the withheld UKB test-set (90 HCC cases / 93533 controls for “All” and 72 cases / 20958 controls for “PAR”) revealed comparable performance throughout Model A-E for All and PAR (Figure 3 b, e, Table 3, Supplementary Fig. 3 e, Supplementary Table 18, 19), with blood parameters consistently as the most relevant modality (AUC 0.86 and AUC 0.87 for All/PAR respectively), followed by demographics (0.80 / 0.78), metabolomics (0.79 / 0.77), diagnosis (0.74 / 0.73) and SNPs (0.62 / 0.55). The single-modality models based on blood, demographics, and metabolomics all outperformed literature scores. ![Fig. 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/04/2024.11.03.24316662/F3.medium.gif) [Fig. 3:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/F3) Fig. 3: Prediction metrics **a, d** Split violin plots displaying distribution of predicted probabilities for true negative (TN, left half violin) and true positive (TP, right half violin) cases per trained modality in test set for all patients (a) or PAR patients (d) in UKB test set. Quartiles (25th, 50th, 75th) displayed with dotted lines. **c, d, e, f** Model performance on the test set as receiver operating characteristic (ROC) curves and corresponding area under the curve (AUC) for either separate models (b, e) or combined scenarios A-E (c, f). Each line represents the performance of one majority vote model from five-fold cross-validation. Benchmark score ’aMAP’ (AGE + Male + (Albumin – Bilirubin) + Platelets²) and random guess depicted by dotted lines. **g** Legend for contribution of modalities per incremental model **h, i** Precision-recall curves for Model A-E, as well as aMAP score, with magnified section for precision = 0-0.2. TP = True Posi-tives, FP = False Positives, FN= False Negatives. **j** Split violin distributions for prediction scores of models with reduced feature numbers (as in a/b) for UKB cohort. **k** ROC and indicated AUROC for reduced models in UKB, corresponding to j. **l** PRCs (as in h, i) for reduced models in UKB. Clinical routine is multimodal, and a single modality likely never captures the true complexity. We therefore hypothesized a) that combination of modalities could improve model perfor-mance substantially, and b) that based on results from separate models, less affordable omics methods might not increase model performance substantially. Incremental models indeed re-vealed overall superior performance compared to separate models, with a plateau at an AU-ROC of 0.88 with model C, a combination of demographics, diagnosis and blood data (Fig 3 c, f, g). AUROCs could not be further improved by adding genetic information, and only im-proved by metabolomics for the All cohort, though this could not be tested externally. Precision-recall curves (PRC) surprisingly revealed extremely low performance for the aMAP-score (Figure 3h, i) and other literature benchmarks (Supplementary Fig. 3f), with areas under the PRCs (AUPRC) of 0.00 to 0.01. The only exception for this was prediction of HCC based solely on the ICD-code Liver Cirrhosis (AUPRC of 0.14, Supplementary Fig. 3f), which proved superior AUPRCs also to ML models, while AUROCs for cirrhosis were worse (AUC 0.57 for UKB, 0.71 for AOU). For ML models, addition of blood data increased the AUPRC from negli-gible values to 0.06 (All) and 0.09 (PAR), with further increases both for D and E, although only for low-recall settings, i.e. very high risk cases. Meanwhile curves aligned completely for higher recall values, suggesting no additional contribution to overall discriminatory capabilities in a broader population. ### Explainability methods highlight known liver cancer risk factors To increase practicability, we next performed an ablation study to reduce the number of fea-tures needed for the model (Fig. 3j-l). We employed a two-step approach, focusing firstly on routinely assessed features and secondly on removal of features with low relevance to the model, for which we extracted the models’ feature importance 42. First, a team of clinicians curated a set of 75 exclusively routine clinical features. For these, we then gradually removed features with lowest feature importance (TOP 75 features, TOP 30, TOP 15). Interestingly, a significant reduction in performance was observed between TOP75 and TOP30 (p = 0.0012, two-sided DeLong-Test), whereas performance of TOP30 and TOP15 did not differ signifi-cantly (p = 1, two-sided DeLong-Test). Mirroring this, a performance reduction was observed in AUPRCs from TOP75 to TOP30, but not from TOP30 to TOP15 (Figure 3l, Supplementary Table 23, 24). To corroborate that the models’ strong performances were due to the selected features and not merely the RFC architecture, we trained a RFC using only variables included in the aMAP score (AMAP-RFC). This model outperformed the original aMAP-score, while being significantly worse than the models with broader feature representation, suggesting that both the architecture and the feature amount contribute to the strong performance of our mod-els. We hypothesized that features relevant to the ML classifier would reflect the well characterized risk landscape also represented in linear risk scores. We therefore investigated the contribu-tion weights of individual features for the model decision, and compared those per feature and per modality. We consistently observed the highest modality importance for blood-based fea-tures such as γ-GT, AST and ALT and platelets. Importantly however, feature importance was distributed more nuanced over the features, also incorporating measurements such as IGF-1, waist circumference and many more. In total, over 20 features contributed each with > 1 % (Figure 4 a, b, Supplementary Table 25). We further observed substantially different contribu-tions to model prediction per modality. Blood data consistently had the highest contribution, both when assessed as mean or sum of feature importance per modality. This was followed by demography and lifestyle, EHR, and lastly genomics with negligible relevance for the model. ![Fig. 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/04/2024.11.03.24316662/F4.medium.gif) [Fig. 4:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/F4) Fig. 4: Interpretability of HCC risk prediction models **a, b** Feature importances of the top-50 features as well as average by feature group (see Supplementary Material for corresponding features and groups) per indicated model, as mean ± SD of accumulation of the impurity de-crease within each tree. **c, e** Split violin plots with displayed thresholding as low (green), middle (yellow) and high (red) risk groups, displaying distribution of predicted probabilities for true negatives (TN, left half violin) and true positive (TP, right half violin) cases per trained modality in test set for all patients (c) or PAR (e). Quartiles (25th, 50th, 75th) displayed with dotted lines. **d, f** Confusion matrices corresponding to predictions scores from c, e re-spectively, with indicated thresholds. Upper panels correspond to single thresholding, lower panels correspond to double thresholding. PPV = Positive predictive value. NPV = Negative predictive value. Coloring indicates relative percentage per class. **g**, **h** Kaplan-Meier curves for relative events per risk group over time. * indicates timeframes that were removed from analysis, either due to removal of events close to baseline examination or due to cut-off by 1st of January 2024. ### Thresholds create actionable classes for HCC risk estimation Continuous scores are more accurate than classifications. However, in a clinical context they are often impractical, as they do not allow for clear decisions. A compromise between contin-uous scores and binary classification is commonly performed by categorizing predictions on an ordinal risk scale, e.g. in low-, middle-and high-risk groups, with a higher “rule-in” and a lower “rule-out” threshold. Based on the distribution of prediction values in the five-fold cross-validation, we established a three-class system, and evaluated the two thresholds on the with-held UKB test set, both for UKB “All” (Figure 4c, d) and “PAR” (Figure 4 e, f). In both cohorts, > 65 % of HCC cases could hereby be classified in the high-risk group with a positive predictive value (PPV) of 1.2 % (“All”) and 5.1 % (“PAR”), negative predictive values (NPV) of 0.9999 (“All”) and 0.9995 (“PAR”). High-and medium-risk groups together accounted for 94.5 % (“All”) and 90.3 % (“PAR”) of HCC cases. Still, as a low-incidence disease, this performance is chal-lenged by a high number of false positive cases. We conducted a comprehensive comparison between the true positive (TP) and false negative (FN) groups, as well as true negative (TN) and false positive (FP) groups. Firstly, there was no significant difference between the time-to-cancer for TP (9.6 ± 3,7 years) compared to FN 9 ± 3,8 years). Notably, for UKB, we found a significantly higher incidence of FN among females compared to males (Table 3). Further-more, distinct patterns emerged among the false negative group, characterized by a lower prevalence of obesity, smoking, and alcohol consumption. ### Generalizable models for HCC risk stratification Finally, we hypothesized that our models would be generalizable to an independent popula-tion. We therefore used all participants of the “All Of Us Research Program” with at least one instance of available blood data and baseline demographic information (n = 126 896, 445 cases of HCC). Incidence of HCC (Figure 5a) and age distribution at first diagnosis (Supple-mentary Fig. 2b) was comparable to UKB data and to literature reports 43. Ethnicity distribution among controls and cases was much more diverse than in the UKB (Figure 5b, Table 2), and we observed a clear tendency towards better EHR representation among HCC cases, espe-cially for liver cirrhosis (Figure 5c, Table 2). After rigorous preprocessing, the models TOP75, TOP30, TOP15 and AMAP-RFC were applied to all AOU patients along linear risk scores. All ML models achieved high AUROCs, interestingly without performance loss for models with reduced features, with also AUPRCs comparable to AUPRCs from the UKB All test cohort (Figure 5, Supplementary Table 20, 21). Again, ML-models consistently outperformed linear risk models (Figure 5 d, e, f, Supplementary Fig. 3c, d, f), despite linear risk models performing slightly better in AOU than in the UKB cohort. While the relative distribution of prediction values for controls and cases was very similar to UKB, we observed statistically significant higher overall prediction values in AOU (0.38 ± 0.15 (AOU) vs 0.33 ± 0.13 (UKB) for controls, p < 0.0001, and 0.68 ± 0.19 (AOU) vs 0.62 ± 0.22 (UKB) for HCC cases, p = 0.008, for Model TOP15, see Supplementary Table 22), reflecting the overall differences in disease burden between the two populations. ![Fig. 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/11/04/2024.11.03.24316662/F5.medium.gif) [Fig. 5:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/F5) Fig. 5: Machine learning models are generalizable to independent populations **a** Histogram of first occurrences of HCC diagnosis in EHR in All Of Us. b Prevalence of the most common disease codes for HCC group (red) and control group (gray), sorted by highest prevalence in HCC group. Arterial HT = Arterial Hypertension, Dis.= Disease, DM = Diabetes mellitus, HSM = Hepatosplenomegaly, AKI = Acute kidney injury, CKD = chronic kidney disease. **c** Ethnic distribution of all participants (outer circle) and HCC cases (inner circle) in All Of Us, corresponding to self-reported survey categories. **c** Ranking of etiologies of HCC cases (n=445) with total numbers + percentage of etiologies in All Of Us. **d** Split violin plots displaying distribution of predicted probabilities for true negative (TN, left half violin) and true positive (TP, right half violin) cases per trained modality in All Of Us. **e** Model performance on All Of us as ROC curves and corresponding AUCs. **f** Precision-recall curves corresponding to models from d, e. **g-l** Confusion matrices corresponding to prediction scores from d-f, split by high vs lower thresholds, female vs male sex or by ethnicity (white vs non-white). **m** Shapley feature importance analysis for Model TOP15, with each datapoint representing a single participant. AST = Aspartate aminotransferase, DM = Diabetes mellitus, ALT = Alanine aminotransferase, HT = hypertension, AP = Alkaline phosphatase. Distance from 0.0 on x-axis indicates feature importance, color indicates direction of feature. View this table: [Table 2:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/T2) Table 2: Baseline characteristics of All of Us Research Program study population stratified by sex and HCC status. Baseline characteristics of study participants (N = 126 896) stratified by sex and hepatocellular carcinoma (HCC) status. Data are presented as mean (± SD) for continuous variables and n (%) for categorical variables. P-values were calculated using Student’s t-test for continuous variables and chi-square test for categorical variables. Fea-tures with less than 20 individual data points grouped together are masked (displayed as > 20) as is the requirement of the All Of Us Research Workbench. Q-values represent Bonferroni-adjusted p-values to account for multiple testing. Again, model performance was better for male than for female cases, although with less pro-nounced differences (Figure 5, Table 3). Interestingly, model performance for European vs non-European populations did not differ significantly and did especially not lean towards better performance for the European population, despite training on 94 % data from “Caucasian” ethnicity, representing the distribution in UKB (Figure 5, Supplementary Table 21). View this table: [Table 3:](http://medrxiv.org/content/early/2024/11/04/2024.11.03.24316662/T3) Table 3: Confusion matrix sub-analysis for both UKB and AOU cohort Baseline characteristics split by bins of confusion matrix (FN = False negatives, TP = True positives, FP = False positives, TN = True negatives) for both UKB and AOU cohorts at the “rule-in” threshold of 0.55, with > 0.55 = High Risk / positive class, < 0.55 = Low or middle risk / negative class. Categories for ethnicity were unified and simplified to allow for shared visualization. Features with less than 20 individual data points grouped together are masked (displayed as > 20) as is the requirement of the All Of Us Research Workbench. For logical features (True/False), only positive class is displayed. Percentages represent a fraction of population per column. Only categorical fea-tures were subjected to comparison of significant differences. q-values represent Bonferroni-adjusted p-values to account for multiple testing. ## Discussion Our study shows that simple, interpretable machine learning algorithms can accurately stratify patients according to their individual risk of developing HCC on a population scale. We present these models along with systematic evaluation of risk scores proposed in literature. Evaluation on two of the largest available cohorts worldwide shows superior performance of the devel-oped ML models compared to all publicly available risk scores and provide robust explainabil-ity for our algorithms. Current HCC screening algorithms incorporate cirrhosis, a highly specific, but insensitive risk factor for HCC, as the sole inclusion criterion for apparative screening. This is problematic as firstly, compensated cirrhosis can be occult and therefore not diagnosed in a large fraction of patients and secondly, the percentage of HCC in non-cirrhotic patients is steadily growing, widening the population at risk. This gap in screening facilitates late-stage diagnosis and high mortality of HCC. Non-invasive risk scores used as surrogates for liver fibrosis and predictors of liver-related mortality are usually developed in small, tertiary-hospital-based cohorts, lack accuracy, and do not represent the complex individual risk landscape ranging from diagnosis, lifestyle, demographics to bloodwork. Further, they are rarely validated on population-level. We hypothesized that population-based cohorts with longitudinal follow-up are the ideal foun-dation for generalizable risk stratification in HCC. This allowed for simulation of a prospective evaluation on two large, independent test sets, according to the “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)”, with firstly, in-dependent testing on all data from UKB centers in Scotland and Wales (Nonrandom split-sample development, TRIPOD 2b), and secondly, the entire “All Of Us” cohort (Separate test data set, TRIPOD 3) 44. To account for distinct pre-test probabilities, especially between primary and tertiary-care cen-ters, we evaluated all our models in a general population and for a subset of patients-at-risk, with pre-diagnosed liver disease or elevated liver enzymes at baseline. We show that, based on performance in two distinct populations, the aMAP-score is the best-performing publicly available linear risk score for HCC. However, despite a modest accuracy, negligible precision and recall lead to an extremely high number of false positive cases for HCC risk. The observed imbalance between metrics is presumed to be caused by the inherent properties of receiver-operating-characteristics. These can cause overestimation of model performance in imbal-anced data, which is the case in virtually every screening scenario for low-prevalence diseases 45 46,47. In contrast to aMAP, presence of cirrhosis prior to HCC was a highly specific, but in-sensitive predictor of HCC. This suggests that cirrhosis alone is a good “rule-in” discriminator for HCC-screening, but an insufficient “rule-out” discriminator in a broad population, while it is still applied here as a sole means of decision-support in various guidelines 7,48. Our study presents the first systematic comparisons for relevance of independent data modal-ities. We show that training machine learning models on standard blood values like transami-nases and platelets together with EHR data and lifestyle is sufficient for accurate risk stratifi-cation for HCC, in line with previous work for other cancer types and relevant variables in linear risk scores for HCC 49,50,38–40. Interestingly, genomic data did not increase model perfor-mance despite representable prevalences of known genetic risk factors like PNPLA3 6 (Figure 2g). In line with previous reports, this questions the relevance of polygenic risk scores (PRS) for risk prediction of HCC compared to phenotypic modeling 8, 51,52. Meanwhile, metabolomics did increase model performance to some extent, but they are not routinely measured, more expensive and eventually not necessary given the satisfactory performance of models based on routine clinical data alone. Investigation of model explainability highlights our models, especially the slimmer models TOP15 -TOP75, as clinically feasible and interpretable tools 53. We confirm liver enzymes, platelets, demographic features such as age and weight, but also arterial hypertension and liver cirrhosis as predictive features, clearly aligning with clinical expertise and literature find-ings 34,38,40,49,54. Still, certain features contributed less than expected to the model feature im-portance, indicating present biases and calling for cautious interpretation. E.g., the social de-sirability bias could explain systematic underestimation of the relevance of alcohol consump-tion in ML models, as representation of alcohol consumption has been reported to be incorrect 55. Further, training on populations with better representation of liver cirrhosis as ICD-code could have assigned higher feature importance to liver cirrhosis. Our study has several limitations. We did not benchmark our models against previously pub-lished ML models for HCC risk, most of which were developed in smaller hospital-based co-horts and lacked external testing 11–13,56. This was not possible, as none of the prior studies made their model architectures or weights publicly available, highlighting the need for a shift in thinking towards a more collaborative framework. The development of clear thresholds poses a further limitation to our study. The thresholds developed during five-fold cross-valida-tion on UKB training data achieved very high accuracy when validated on the UKB test set comprised of completely independent test centers from different countries. However, it is still biased by the same active recruitment method, resulting in the well characterized reduced disease burden in the overall UKB population 22. The All Of Us Research program, however, encompasses mostly hospital data. Therefore, while the model performance is not inferior in AOU vs UKB, the absolute prediction values are higher in AOU, challenging the proposed thresholds and reflecting well-characterized differences in disease burden 23. Most-likely, thresholds will have to be developed independently for certain populations, taking into account the respective disease burden, but also economic considerations that might differ depending on respective populations and healthcare systems. Cost-benefit analysis will have to be per-formed to estimate ideal proportions and screening intervals. Importantly, the number needed to screen (NNS) correlates inversely with the fraction of potential HCCs one will detect. Aiming at a detection rate of 2 out of 3 of cases of HCC in the screening program (reflecting our rule-in threshold of 0.55) would lead to a NNS of 70-80 in the general population, or a NNS of 26 in the PAR population, with higher or lower NNS according to the desired detection rate (Sup-plementary Table 19, 21). In contrast to this, guideline-adhering screening programs could have only detected 1 in 5 HCC cases even in the unlikely scenario of perfect screening at-tendance and perfect detection rate (Figure 2d). In clinical routine, estimation of annual risk bears greater clinical relevance compared to all-time risk of HCC. Our study lacks models of individual changes in risk of HCC due to lack of trajectory data. In future work, coupling time-to-event analysis with ML-decision trees such as random survival forests, could be beneficial 57. Our analysis, like all population-based studies, relies on accurate encoding of EHR. Notably, evaluating our models exclusively on cases with HCC encoding both in NHS and national cancer registries, opposed to NHS-only, revealed a strongly reduced number of false negatives as opposed to evaluation on all EHR-encoded cases. This suggests that those cases, initially labeled as FN, were actually TN and the model was correct when interpreting low-risk phenotypes as negative. Still, a fraction of approxi-mately 5-10 % of HCC cases were wrongly assigned to the low-risk group, for UKB and AOU test sets alike. Analysis of these misclassified cases revealed notable insights. Consistently, our classifiers demonstrated a higher tendency to overlook HCC in female patients compared to male patients. This could be explained by an overall less prominent HCC phenotype in females or the higher prevalence for males, challenging the assumption that a combined model for all sexes is actually feasible. Interestingly, this performance gap was less-pro-nounced in the All Of Us cohort, with furthermore no differences in performance for the “White” ethnicity vs other ethnicities, suggesting good generalizability. However, validation in other populations, especially Asian populations with higher prevalence of viral hepatitis, is still pend-ing. There are numerous obstacles on the path to clinical implementation of ML risk models for HCC prediction. Firstly, prospective clinical trials will be needed before clinical adoption can be recommended. Setting those up isolated for HCC-screening will be challenging due to its low incidence. We show that in most cases, steatotic liver diseases are only diagnosed after diagnosis of HCC. This suggests that current standard practice, which links HCC screening to prior diagnosis of liver cirrhosis, is likely too narrowly defined, and that accurate, non-invasive biomarkers are needed both for prediction of HCC and for prediction of cirrhotic liver disease. It could therefore be beneficial to set up multi-purpose trials that screen for liver morbidity in general and HCC alike, as e.g. the LiverAIM trial aims to do 58. Current clinical routine de-mands an incidence of 1.5 % in the screening group for HCC screening to become economi-cally feasible 59, and a minimal annual incidence of 0.2 % in the respective population to allow for effective screening 7,48. Patients in our model’s high-risk group clearly surpass the annual incidence of 0.2 % in both cohorts, even despite removal of HCC cases near baseline, removal of cases not confirmed by cancer registries, and a lower general incidence of HCC in the UKB than in the general population. While thresholds would have to be adapted to surpass the limit of 1.5 % incidence, this limit might change in favor of broader screening strategies, as treat-ment with expensive immune checkpoint inhibition proves successful also in earlier stages, increasing therapy cost extensively 60. From a technical perspective, our models can be integrated effortlessly for further research purposes and, once a medical device, in clinical practice. The simple random forest architec-ture with as few as 15 affordable and routinely available parameters can be validated on in-dependent data with the python package we provide. Further, for research-purposes only, it can be accessed for individual inference via a web-tool we provide at [https://hugging-face.co/spaces/schneiderlab/ML-HCC](https://hugging-face.co/spaces/schneiderlab/ML-HCC), which also agentic workflows using large-language-models could access in the future 61. This could enable inference of a multitude of prediction models at once compared to tedious manual variable entering, as the latter will naturally limit clinical usage 61. Finally, long-term integration should ideally happen interoperable with hospi-tal-information-systems, where input data can be automatically inferred for a multitude of mod-els, as machine-learning predictions enter clinical routine in virtually every discipline of healthcare 62,63. Our study reveals critical gaps in the current screening process for HCC. It addresses a clini-cally highly relevant question in an interpretable way with actionable output. Our model out-performs all established linear risk calculators for HCC by leveraging interpretable machine learning and extensive population-based data. As risk calculators are a routinely used practice especially in hepatology, e.g. for estimation of liver-related mortality 64, its implementation into clinical routine could come naturally, especially given its reliance on simple, ubiquitously avail-able parameters. We are confident that, after necessary prospective evaluation, our model can therefore contribute as a decision-aid for clinicians, integrating risk factors for a personal-ized assessment of risk, allowing earlier intervention and potentially curative treatment. ## Supporting information Supplementary Material (except Tables) [[supplements/316662_file02.pdf]](pending:yes) Supplementary Tables 1-25 [[supplements/316662_file03.xlsx]](pending:yes) ## Data Availability UK Biobank data, including NMR metabolomics, are publicly available to bona fide researchers upon application at [http://www.ukbiobank.ac.uk/using-the-resource/](http://www.ukbiobank.ac.uk/using-the-resource/). Detailed information on predictors and endpoints used in this study is presented in Supplementary Tables 1-25. This study used data from the All of Us Research Program's Controlled Tier Dataset v7, available to authorized users on the Researcher Workbench. [http://www.ukbiobank.ac.uk/using-the-resource/](http://www.ukbiobank.ac.uk/using-the-resource/) [https://allofus.nih.gov/](https://allofus.nih.gov/) ## Additional Information ## Competing interests JNK declares consulting services for Bioptimus, France; Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK; AstraZeneca, UK; Mindpeak, Germany; and MultiplexDx, Slovakia. Furthermore, he holds shares in StratifAI GmbH, Germany, Synagen GmbH, Germany, and has received a research grant by GSK, and has received hon-oraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer and Fresenius. TB has served on advisory boards for AdvanzPharma/Intercept Pharmaceuticals, SOBI, Novartis, and Gilead, and has received speaker fees from Falk Foundation, CSL Behring, Norgine, Intercept, Abbvie, Gilead, Merck, and Gore. OSMEN holds shares in StratifAI GmbH, Germany. Apichat Kaewdech received research grants or support from Roche, Roche Diagnostics, and Abbott Laboratories, and honoraria from Roche, Roche Diagnostics, Abbott Laboratories, and Esai. ## Financial support JC is supported by the Mildred-Scheel-Postdoktorandenprogramm of the German Cancer Aid (grant #70115730). JNK is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111), the Max-Eder-Programme of the German Cancer Aid (grant #70113864), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGE-RINE, 01KT2302 through ERA-NET Transcan), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (Transplant.KI, 01VSF21048) the European Union’s Horizon Europe and inno-vation programme (ODELIA, 101057091; GENIAL, 101096312) and the National Institute for Health and Care Re-search (NIHR, NIHR213331) Leeds Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. DT is supported by the German Federal Ministry of Education and Research (SWAG, 01KD2215A; TRANSFORM LIVER), the Euro-pean Union’s Horizon Europe and innovation programme (ODELIA, 101057091). TL was funded by the German Cancer Aid (Deutsche Krebshilfe -DECADE 70115166), the Federal Ministry of Education and Research (BMBF -TRANSFORM LIVER 031L0312B) and the Federal Ministry of Health (BMG - DEEP LIVER 2520DAT111). TB is supported by the German Research Foundation (SFB1382 Project ID 403224013/B07). C.V.S is supported by a grant from the Interdisciplinary Centre for Clinical Research within the faculty of Medicine at the RWTH Aachen University (PTD 1-13/IA 532313), the Junior Principal Investigator Fellowship program of RWTH Aachen Excel- lence strategy and the NRW Rueckkehr Programme of the Ministry of Culture and Science of the German State of North Rhine-Westphalia. K.M.S is supported by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the German State of North Rhine-Westphalia under the Excellence strategy of the federal government and the Laender as well as the NRW Rueckkehr Programme of the Ministry of Culture and Science of the German State of North Rhine-Westphalia. C.V.S and K.M.S are supported by the CRC 1382 project A11 and B09 funded by Deutsche Forschungsgesellschaft (DFG, German Research Foundation) – Project-ID 403224013 – SFB 1382“. D.Y.Z. is supported by the National Heart, Lung, and Blood Institute of the National Institute of Health under award number F30HL172382. ## Author contributions JC, JNK and CVS conceptualized the study and wrote the manuscript. JC and FvH executed the preprocessing. JC and PHK built the modeling pipeline. JC and DZ executed the external validation. PHK and TS built the web-application. OSMEN, LZ, JuC, FT, TL, TB, AK, DT and KMS provided invaluable feedback and revised the manu-script. ## Data availability UK Biobank data, including NMR metabolomics, are publicly available to bona fide researchers upon application at [http://www.ukbiobank.ac.uk/using-the-resource/](http://www.ukbiobank.ac.uk/using-the-resource/). Detailed information on predictors and endpoints used in this study is presented in Supplementary Tables 1-25. This study used data from the All of Us Research Program’s Controlled Tier Dataset v7, available to authorized users on the Researcher Workbench. ## Code availability All code developed and used throughout this study has been made open source and is available on GitHub. The code used for preprocessing can be found here: Code for model development can be found here: [https://github.com/schneiderlabac/hcc\_u\_soon](https://github.com/schneiderlabac/hcc_u_soon) ## Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work the authors used GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) in order to correct spelling and grammar and for coding assistance, in accordance with the COPE (Committee on Publica-tion Ethics) position statement of 13 February 2023 67. After using this tool/service, the authors reviewed and edited ## Ethics disclaimer We acknowledge the ethical complexities of categorizing individuals in medical research, especially when catego-rizing by ethnicity. While any form of categorization risks perpetuating discrimination, the complete omission of such considerations in medical AI could paradoxically reinforce healthcare disparities. Machine learning models are sensitive to training data composition and can silently perpetuate or amplify existing biases if their performance across different populations remains untested. By explicitly examining model performance across ethnic groups, we aim to ensure equitable predictive accuracy and identify potential disparities in model generalizability that re-quire attention. This approach aligns with the broader goal of developing inclusive healthcare solutions while re-maining mindful of the need to handle demographic data with appropriate sensitivity and scientific rigor. Our cate-gorizations follow standardized reporting guidelines while recognizing that such classifications are social constructs and cannot fully capture human diversity. Our study does not include confidential information. All research proce-dures were conducted exclusively on anonymized patient data and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards. ## Acknowledgements This research has been conducted using the UK Biobank Resource under Application Number 71300. UK biobank data was accessed by J.N.C, C.V.S and K.M.S. Copyright © 2024, NHS England. Re-used with the permission of the NHS England and/or UK Biobank. All rights reserved. This work uses data provided by patients and collected by the NHS as part of their care and support. All UKB analyses have been performed by JC and PHK. The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers (OT2 OD026549; OT2 OD026554; OT2 OD026557; OT2 OD026556; OT2 OD026550; OT2 OD 026552; OT2 OD026553; OT2 OD026548; OT2 OD026551; OT2 OD026555); Inter agency agreement AOD 16037; Feder-ally Qualified Health Centers HHSN 263201600085U; Data and Research Center: U2C OD023196; Genome Cen-ters (OT2 OD002748; OT2 OD002750; OT2 OD002751); Biobank: U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: U24 OD023163; Communications and Engagement: OT2 OD023205; OT2 OD023206; and Community Partners (OT2 OD025277; OT2 OD025315; OT2 OD025337; OT2 OD025276). We gratefully acknowledge All of Us participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program for making available the participant data examined in this study. ## Abbreviations AASLD : American Association for the Study of Liver Diseases ALT : Alanine Aminotransferase AOU : All Of Us Research Program AST : Aspartate Aminotransferase AUPRC : Area Under the Precision-Recall Curve AUROC : Area Under the Receiver Operating Characteristic Curve BMI : Body Mass Index CI : Confidence Interval CLD : Chronic Liver Disease COPE : Committee on Publication Ethics EASL : European Association for the Study of the Liver EHR : Electronic Health Records FDR : False Discovery Rate FN : False Negative FP : False Positive γ-GT : Gamma Glutamyltransferase HCC : Hepatocellular Carcinoma ICD : International Classification of Diseases IGF-1 : Insulin-like Growth Factor 1 MASLD : Metabolic Dysfunction-Associated Steatotic Liver Disease ML : Machine Learning NMR : Nuclear Magnetic Resonance NNS : Number Needed to Screen NPV : Negative Predictive Value OMOP : Observational Medical Outcomes Partnership PAR : Patients at Risk PPV : Positive Predictive Value PRC : Precision-Recall Curve PRS : Polygenic Risk Score RFC : Random Forest Classifier ROC : Receiver Operating Characteristic SD : Standard Deviation SHAP : SHapley Additive exPlanations SNP : Single Nucleotide Polymorphism TN : True Negative TP : True Positive TRIPOD : Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis UKB : UK Biobank XGB : Extreme Gradient Boosting * Received November 3, 2024. * Revision received November 3, 2024. * Accepted November 4, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. 1.Ioannou, G. N. Epidemiology and risk-stratification of NAFLD-associated HCC. J. Hepatol. 75, 1476–1484 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhep.2021.08.012&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34453963&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 2. 2.Jembere, N. et al. Influence of Socioeconomic Status on Survival of Hepatocellular Carcinoma in the Ontario Population; A Population-Based Study, 1990–2009. PLoS One 7, e40917 (2012). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0040917&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22808283&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 3. 3.Suddle, A. et al. British Society of Gastroenterology guidelines for the management of hepatocellular carcinoma in adults. Gut gutjnl–2023–331695 (2024). 4. 4.Forner, A., Reig, M. & Bruix, J. Hepatocellular carcinoma. Lancet 391, 1301–1314 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(18)30010-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29307467&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 5. 5.Bianco, C. et al. Non-invasive stratification of hepatocellular carcinoma risk in non-alcoholic fatty liver using polygenic risk scores. J. Hepatol. 74, 775–782 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhep.2020.11.024&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33248170&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 6. 6.Buch, S. et al. A genome-wide association study confirms PNPLA3 and identifies TM6SF2 and MBOAT7 as risk loci for alcohol-related cirrhosis. Nat. Genet. 47, 1443– 1448 (2015). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3417&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26482880&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 7. 7.Galle, P. R. et al. EASL Clinical Practice Guidelines: Management of hepatocellular carcinoma. J. Hepatol. 69, 182–236 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhep.2018.03.019&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29628281&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 8. 8.Nahon, P. et al. Integrating genetic variants into clinical models for hepatocellular carcinoma risk stratification in cirrhosis. J. Hepatol. 78, 584–595 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36427656&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 9. 9.Kondili, L. A. et al. Inequities in primary liver cancer in Europe: The State of Play. J. Hepatol. (2024) doi:10.1016/j.jhep.2023.12.031. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhep.2023.12.031&link_type=DOI) 10. 10.Kim, N. J., et al. Addressing racial and ethnic disparities in US liver cancer care. Hepatol. Commun. 7, e00190 (2023). 11. 11.An, C. et al. Prediction of the risk of developing hepatocellular carcinoma in health screening examinees: a Korean cohort study. BMC Cancer 21, 755 (2021). 12. 12.Singal, A. G. et al. Machine learning algorithms outperform conventional re-gression models in predicting development of hepatocellular carcinoma. Am. J. Gastro-enterol. 108, 1723–1730 (2013). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ajg.2013.332&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24169273&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 13. 13.Lee, H. W. et al. A machine learning model for predicting hepatocellular carci-noma risk in patients with chronic hepatitis B. Liver Int. 43, 1813–1821 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37452503&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 14. 14.Sarkar, S. et al. A machine learning model to predict risk for hepatocellular carcinoma in patients with metabolic dysfunction-associated steatotic liver disease. Gastro Hep Adv. 3, 498–505 (2024). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=39131709&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 15. 15.Xu, Y. et al. Development of machine learning-based personalized predictive models for risk evaluation of hepatocellular carcinoma in hepatitis B virus-related cirrho-sis patients with low levels of serum alpha-fetoprotein. Ann. Hepatol. 101540 (2024). 16. 16.Serra-Burriel, M. et al. Development, validation, and prognostic evaluation of a risk score for long-term liver-related outcomes in the general population: a multicohort study. Lancet 402, 988–996 (2023). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(23)01174-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37572680&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 17. 17.Liu, Z. et al. Point-based risk score for the risk stratification and prediction of hepatocellular carcinoma: a population-based random survival forest modeling study. EClinicalMedicine 75, 102796 (2024). 18. 18.Calderaro, J., Seraphin, T. P., Luedde, T. & Simon, T. G. Artificial intelligence for the prevention and clinical management of hepatocellular carcinoma. J. Hepatol. 76, 1348–1361 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35589255&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 19. 19.Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pmed.1001779&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25826379&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 20. 20.Bragg, F. et al. Predictive value of circulating NMR metabolic biomarkers for type 2 diabetes risk in the UK Biobank study. BMC Med. 20, 159 (2022). 21. 21.Sveinbjornsson, G. et al. Multiomics study of nonalcoholic fatty liver disease. Nat. Genet. 54, 1652–1663 (2022). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-022-01199-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36280732&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 22. 22.Bycroft, C. et al. The UK Biobank resource with deep phenotyping and ge-nomic data. Nature 562, 203–209 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-018-0579-z&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30305743&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 23. 23.Zeng, C. et al. Comparison of phenomic profiles in the All of Us Research Program against the US general population and the UK Biobank. J. Am. Med. Inform. Assoc. (2024) doi:10.1093/jamia/ocad260. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/jamia/ocad260&link_type=DOI) 24. 24.All of Us Research Program Investigators et al. The ‘All of Us’ Research Pro-gram. N. Engl. J. Med. 381, 668–676 (2019). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsr1809937&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31412182&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 25. 25.Burton, A. et al. Primary liver cancer in the UK: Incidence, incidence-based mortality, and survival by subtype, sex, and nation. JHEP Rep. 3, 100232 (2021). 26. 26.Lange, N. F., Radu, P. & Dufour, J.-F. Prevention of NAFLD-associated HCC: Role of lifestyle and chemoprevention. J. Hepatol. 75, 1217–1227 (2021). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34339764&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 27. 27.Abel, G. A., Barclay, M. E. & Payne, R. A. Adjusted indices of multiple depri-vation to enable comparisons within and between constituent countries of the UK includ-ing an illustration using mortality rates. BMJ Open 6, e012750 (2016). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYm1qb3BlbiI7czo1OiJyZXNpZCI7czoxMjoiNi8xMS9lMDEyNzUwIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTEvMDQvMjAyNC4xMS4wMy4yNDMxNjY2Mi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 28. 28.Trépo, E. et al. Common genetic variation in alcohol-related hepatocellular carcinoma: a case-control genome-wide association study. Lancet Oncol. 23, 161–171 (2022). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1470-2045(21)00603-3&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34902334&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 29. 29.Innes, H. et al. The rs429358 locus in apolipoprotein E is associated with hepatocellular carcinoma in patients with cirrhosis. Hepatol. Commun. 6, 1213–1226 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34958182&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 30. 30.Buch, S. et al. Genetic variation in TERT modifies the risk of hepatocellular carcinoma in alcohol-related cirrhosis: results from a genome-wide case-control study. Gut 72, 381–391 (2023). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ3V0am5sIjtzOjU6InJlc2lkIjtzOjg6IjcyLzIvMzgxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTEvMDQvMjAyNC4xMS4wMy4yNDMxNjY2Mi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 31. 31.Li, S. et al. GWAS identifies novel susceptibility loci on 6p21.32 and 21q21.3 for hepatocellular carcinoma in chronic hepatitis B virus carriers. PLoS Genet. 8, e1002791 (2012). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1002791&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22807686&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 32. 32.Chambers, J. C. et al. Genome-wide association study identifies loci influenc-ing concentrations of liver enzymes in plasma. Nat. Genet. 43, 1131–1138 (2011). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.970&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22001757&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 33. 33.Romeo, S. et al. Genetic variation in PNPLA3 confers susceptibility to nonal-coholic fatty liver disease. Nat. Genet. 40, 1461–1465 (2008). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.257&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=18820647&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000261215900023&link_type=ISI) 34. 34.Innes, H. et al. Performance of routine risk scores for predicting cirrhosis-re-lated morbidity in the community. J. Hepatol. 77, 365–376 (2022). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35271950&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 35. 35.Zhou, H. et al. Multi-ancestry study of the genetics of problematic alcohol use in over 1 million individuals. Nat. Med. 29, 3184–3192 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38062264&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 36. 36.Malhotra, R. & Singh, P. Recent advances in deep learning models: a sys-tematic literature review. Multimed. Tools Appl. 82, 44977–45060 (2023). 37. 37.Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 81, 84–90 (2022). 38. 38.Fan, R. et al. aMAP risk score predicts hepatocellular carcinoma development in patients with chronic hepatitis. J. Hepatol. 73, 1368–1378 (2020). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32707225&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 39. 39.Sterling, R. K. et al. Development of a simple noninvasive index to predict sig-nificant fibrosis in patients with HIV/HCV coinfection. Hepatology 43, 1317–1325 (2006). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/hep.21178&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16729309&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000237984500018&link_type=ISI) 40. 40.Loaeza-del-Castillo, A., Paz-Pineda, F., Oviedo-Cárdenas, E., Sánchez-Avila, F. & Vargas-Vorácková, F. AST to platelet ratio index (APRI) for the noninvasive evalua-tion of liver fibrosis. Ann. Hepatol. 7, 350–357 (2008). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19034235&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 41. 41.Treeprasertsuk, S., Björnsson, E., Enders, F., Suwanwalaikorn, S. & Lindor, K. D. NAFLD fibrosis score: a prognostic predictor for mortality and liver complications among NAFLD patients. World J. Gastroenterol. 19, 1219–1229 (2013). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23482703&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 42. 42.Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. arXiv [cs.LG*]* 2825–2830 (2012). 43. 43.Abboud, Y. et al. Hepatocellular Carcinoma Incidence and Mortality in the USA by Sex, Age, and Race: A Nationwide Analysis of Two Decades. Journal of clinical and translational hepatology 12, (2024). 44. 44.Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. Transparent report-ing of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 13, 1 (2015). 45. 45.Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861– 874 (2006). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.patrec.2005.10.010&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000237462800002&link_type=ISI) 46. 46.Mosquera, C., Ferrer, L., Milone, D. H., Luna, D. & Ferrante, E. Class imbal-ance on medical image classification: towards better evaluation practices for discrimina-tion and calibration performance. Eur. Radiol. (2024) doi:10.1007/s00330-024-10834-0. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00330-024-10834-0&link_type=DOI) 47. 47.Žigutytė, L., Sorz-Nechay, T., Clusmann, J. & Kather, J. N. Use of artificial in-telligence for liver diseases: A survey from the EASL congress 2024. JHEP Rep. 101209 (2024). 48. 48.Singal, A. G. et al. AASLD Practice Guidance on prevention, diagnosis, and treatment of hepatocellular carcinoma. Hepatology 78, 1922–1965 (2023). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/HEP.0000000000000466&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37199193&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 49. 49.Jung, A. W. et al. Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study. Lancet Digit Health 6, e396–e406 (2024). 50. 50.Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med. 29, 1113–1122 (2023). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-023-02332-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37156936&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 51. 51.Turnbull, C. et al. Population screening requires robust evidence-genomics is no exception. Lancet 403, 583–586 (2024). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(23)02295-X&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38070525&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 52. 52.Singal, A. G. et al. International Liver Cancer Association (ILCA) white paper on hepatocellular carcinoma risk stratification and surveillance. J. Hepatol. 79, 226–239 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36854345&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 53. 53.Lundberg, S. M. et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2, 56–67 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/S42256-019-0138-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32607472&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 54. 54.Garg, M. et al. Disease prediction with multi-omics and biomarkers empowers case-control genetic discoveries in the UK Biobank. Nat. Genet. 56, 1821–1831 (2024). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-024-01898-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=39261665&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 55. 55.Davis, C. G., Thake, J. & Vilhena, N. Social desirability biases in self-reported alcohol consumption and harms. Addict. Behav. 35, 302–311 (2010). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.addbeh.2009.11.001&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19932936&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000275135900002&link_type=ISI) 56. 56.Minami, T. et al. Machine learning for individualized prediction of hepatocellu-lar carcinoma development after the eradication of hepatitis C virus with antivirals. J. Hepatol. (2023) doi:10.1016/j.jhep.2023.05.042. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhep.2023.05.042&link_type=DOI) 57. 57.Pickett, K. L., Suresh, K., Campbell, K. R., Davis, S. & Juarez-Colunga, E. Random survival forests for dynamic predictions of a time-to-event outcome using a lon-gitudinal biomarker. BMC Med. Res. Methodol. 21, 216 (2021). 58. 58.Thiele, M., Pose, E., Juanola, A., Mellinger, J. & Ginès, P. Population screen-ing for cirrhosis. Hepatol. Commun. 8, e0512 (2024). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=39185917&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 59. 59.Innes, H. & Nahon, P. Statistical perspectives on using hepatocellular carci-noma risk models to inform surveillance decisions. J. Hepatol. 79, 1332–1337 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37210001&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 60. 60.Qin, S. et al. Atezolizumab plus bevacizumab versus active surveillance in patients with resected or ablated high-risk hepatocellular carcinoma (IMbrave050): a randomised, open-label, multicentre, phase 3 trial. Lancet 402, 1835–1847 (2023). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37871608&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 61. 61.Ferber, D., et al. Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. arXiv [cs.AI] (2024). 62. 62.Lin, C.-S. et al. AI-enabled electrocardiography alert intervention and all-cause mortality: a pragmatic randomized clinical trial. Nat. Med. 30, 1461–1470 (2024). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38684860&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 63. 63.Mandl, K. D., Gottlieb, D. & Mandel, J. C. Integration of AI in healthcare re-quires an interoperable digital data ecosystem. Nat. Med. 1–4 (2024). 64. 64.Kim, W. R. et al. MELD 3.0: The model for end-Stage Liver Disease updated for the modern era. Gastroenterology 161, 1887–1895.e4 (2021). [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34481845&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F11%2F04%2F2024.11.03.24316662.atom) 65. 65.Sjoberg, D., Whiting, K., Curry, M., Lavery, J. & Larmarange, J. Reproducible Summary Tables with the gtsummary Package. R J. 13, 570 (2021). 66. 66.Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predic-tions. Advances in Neural Information Processing Systems 30, (2017). 67. 67.Authorship and AI tools. *COPE: Committee on Publication Ethics* [https://publicationethics.org/cope-position-statements/ai-author](https://publicationethics.org/cope-position-statements/ai-author).