PT  - JOURNAL ARTICLE
AU  - Muhr, Paula
AU  - Pan, Yating
AU  - Tumescheit, Charlotte
AU  - Kübler, Ann-Kathrin
AU  - Parmaksiz, Hatice Kübra
AU  - Chen, Cheng
AU  - Bolaños Orozco, Pablo Sebastián
AU  - Lienkamp, Soeren S.
AU  - Hastings, Janna
TI  - Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy
AID  - 10.1101/2024.08.21.24312353
DP  - 2024 Jan 01
TA  - medRxiv
PG  - 2024.08.21.24312353
4099  - http://medrxiv.org/content/early/2024/08/21/2024.08.21.24312353.short
4100  - http://medrxiv.org/content/early/2024/08/21/2024.08.21.24312353.full
AB  - Background Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons.Methods We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf’s alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image.Findings The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models.Interpretation Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.Funding This study received support from the University of Zurich’s Digital Society Initiative, and the Swiss National Science Foundation under grant agreement 209510.Evidence before this study The authors searched PubMed and Google Scholar to find publications evaluating text-to-image model outputs for medical applications between 2014 (when generative adversarial networks first become available) and 2024. While the bulk of evaluations focused on task-specific networks generating single types of medical image, a few evaluations emerged exploring the novel general-purpose text-to-image diffusion models more broadly for applications in medical education and synthetic data generation. However, no previous work attempts to develop a systematic approach to evaluate these models’ representations of human anatomy.Added value of this study We present an anatomical error classification system, the first systematic approach to evaluate AI-generated images of humans that enables model and prompt comparisons. We apply our method to a corpus of generated images to compare state of the art large-scale models DALL-E 3 and two models from the Stable Diffusion family.Implications of all the available evidence While our approach enables systematic comparisons, it remains limited by subjectivity and is labour-intensive for images with many represented figures. Future research should explore automation of some aspects of the evaluation through coupled segmentation and classification models.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study received support from the University of Zurich&#039;s Digital Society Initiative, and the Swiss National Science Foundation under grant agreement 209510.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data provided in the manuscript are available online at https://github.com/hastingslab-org/ai-human-images. https://github.com/hastingslab-org/ai-human-images