A pen mark is all you need - Incidental prompt injection attacks on Vision Language Models in real-life histopathology
======================================================================================================================

* Jan Clusmann
* Stefan J. K. Schulz
* Dyke Ferber
* Isabella C. Wiest
* Aurélie Fernandez
* Markus Eckstein
* Fabienne Lange
* Nic G. Reitsam
* Franziska Kellers
* Maxime Schmitt
* Peter Neidlinger
* Paul-Henry Koop
* Carolin V. Schneider
* Daniel Truhn
* Wilfried Roth
* Moritz Jesinghaus
* Jakob N. Kather
* Sebastian Foersch

## Abstract

Vision-language models (VLMs) can analyze multimodal medical data. However, a significant weakness of VLMs, as we have recently described, is their susceptibility to prompt injection attacks. Here, the model receives conflicting instructions, leading to potentially harmful outputs. In this study, we hypothesized that handwritten labels and watermarks on pathological images could act as inadvertent prompt injections, influencing decision-making in histopathology. We conducted a quantitative study with a total of N = 3888 observations on the state-of-the-art VLMs Claude 3 Opus, Claude 3.5 Sonnet and GPT-4o. We designed various real-world inspired scenarios in which we show that VLMs rely entirely on (false) labels and watermarks if presented with those next to the tissue. All models reached almost perfect accuracies (90 - 100 %) for ground-truth leaking labels and abysmal accuracies (0 - 10 %) for misleading watermarks, despite baseline accuracies between 30-65 % for various multiclass problems. Overall, all VLMs accepted human-provided labels as infallible, even when those inputs contained obvious errors. Furthermore, these effects could not be mitigated by prompt engineering. It is therefore imperative to consider the presence of labels or other influencing features during future evaluation of VLMs in medicine and other fields.

## Introduction

Generative artificial intelligence (AI) systems such as large language models (LLMs) are trained on vast amounts of human language data1,2. Various scientific and medical uses for LLMs have been proposed, with the potential to significantly transform and enhance modern medicine3–6. Specifically, LLMs have demonstrated their ability to alleviate documentation burdens and encourage adherence to clinical guidelines4,7. Alongside the swift advancements in LLM technology, there has been notable progress in foundation models and multimodal vision-language models (VLMs)8–10. These models can process both text and images, which further broadens their potential applications in healthcare. Several VLMs have been developed, ranging from those designed for specific medical purposes, such as analyzing pathology images or echocardiograms8,9, to generalist models like GPT-4o, which can be applied across various domains, including healthcare1,2,6,11–13

But as these new technologies are starting to transform modern medicine 6, potential drawbacks and vulnerabilities have to be evaluated 14,15. Prompt injection attacks, where hidden instructions within images can significantly influence VLMs and their decisions, have emerged as one relevant vulnerability 16–21. Prompt injections can be disguised in various forms, e.g. low-contrast settings, whitespace characters, tiny text, metadata, or watermarks 22,23. The attack vector can consist of or be included into any type of information which is passed through the model at runtime. More importantly, these exploits can be used simply by modifying a user’s input. Access to the model itself is not necessary to evade its guardrails and safety mechanisms, alter its output or to exfiltrate sensitive data. We have recently shown that the problem of prompt injection attacks holds dangerous implications for healthcare. Additionally, this vulnerability cannot be easily mitigated in a black-box setting 16,17.

It can be argued that deliberate prompt injection attacks on VLMs with malicious intent are, thus far, a hypothetical scenario, as no VLM is currently licensed as a medical device. Still, incidental modification of medical input data can happen rather easily. One potential use case is in the field of histopathology, where false labels, e.g. as pen marks on or next to the tissue could potentially influence a model’s decision. It is common practice in routine histopathology to use pens and markers to highlight areas of concern or make short notes on glass slides. This can be done for convenience, documentation and educational purposes. Common examples include the number of total and cancerous lymph nodes, parts of the TNM classification, or notes and markups for follow up molecular pathology. As histopathology is currently in the transition from an analogous to a digital workflow, compatibility issues like these can arise at the human-machine-interface, which can have dramatic consequences for medical professionals and their patients. However, it is unclear if such incidental prompt injection attacks can even occur, how dramatic their effect would be, and whether there are mitigation strategies, which could be implemented to prevent the influence of artifacts on model performance. We hypothesized that simple and commonly used pen labels or watermarks could influence the diagnostic accuracy of state-of-the.art VLMs in a manner similar to deliberate prompt injection attacks (Figure 1a).

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/12/12/2024.12.11.24318840/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/F1)

Figure 1: Experimental outline and observational study to estimate the impact and the clinical representation of labeled pathology slides
**a** Different scenarios which can lead to incidental prompt injections in histopathology. Histopathological slides can contain readable text, marked regions and artificial image alterations such as water- or calibration marks. The VLMs Claude-3, Claude-3.5 and GPT-4o were used to assess various cases in three iterations leading to a total of 3888 observations throughout three label categories, four distinct clinically relevant tasks and n = 72 patient cases, where each prompt constellation of model, label, task and patient was repeated 3 times **b** Scatter plot of the percentage of marked slides vs. the number of total slides evaluated in nine different TCGA cohorts. **c** Exploration of fraction of labeled pathology-slides in five different German institutions. At least 100 randomly selected cases were screened for marked regions and readable text per institution. **d**, **e** Box plots for (**e**) percentage of cases with at least one marked slide (left) and the percentage of marked slides of all slides and (**f**) for the percentage of marked regions and readable text on the marked slides. Single data points represent means per participating institution, thick line represents the median. The boxplots represent the median, interquartile range (IQR), and outliers for each group. Images partially created with [www.BioRender.com](https://www.BioRender.com). **f** Examples of readable text and marked regions of different diagnostic slides within the TCGA dataset. Readable text included case numbers, originating institutions, numbers of tumorous and tumor-free lymph nodes, cancer and healthy tissue, tumor size.

## Results

### Multi-centric analysis of pathology slides reveals frequent occurrence of labels on slides

To estimate the extent of the use of pen marks in histopathology, we manually screened 3795 slides of nine different cohorts available in the cancer genome atlas (TCGA), one of the largest publicly available datasets for histopathology to date. Each slide was reviewed, and the presence or absence of pen marks was recorded systematically (Figure 1b, f). The proportion of slides with pen marks varied substantially, ranging from 0.9% in the Ovarian Cancer (OV) cohort to 32.8% in the Prostate Adenocarcinoma (PRAD) cohort. Across all screened TCGA cohorts, an average of 11.6% of slides displayed pen marks (Figure 1b). Of note, the number of slides screened does not necessarily correspond to the number of patients reported in the literature. Some cohorts lacked access to all diagnostic slides, while others included multiple slides per patient. As the TCGA slides have been curated to be used as a research tool, markings might have been erased or slides might have been cropped prior to their upload. To better understand the extent of the use of pen marks, we conducted a survey across five pathology institutions in Germany (Figure 1 c-e). Each institution screened at least 100 randomly selected cases from the past five years. On average, 30.1% of the cases included at least one slide with a pen mark and 18.1% of all slides were marked (Figure 1 d). Among the marked slides, 66.6% contained readable text, while 44.7% showed marked regions of tissue, and some slides showed both (Figure 1 e). These observations are in line with our hypothesis that pen labels on pathology slides are a common phenomenon and could influence the performance of VLMs.

### Incorrect labels on whole-slide images strongly influence accuracy of vision-language models

As VLMs have been shown to be capable of interpreting both pathological images and virtually any kind of label, text, or marking, we investigated whether labels on pathology slides can influence diagnostic accuracy of VLMs. Specifically, we first investigated labels in three different scenarios which are comparatively easy for pathologists to solve without high magnifications. These included (1) labeling the stage of the tumor (according to pT-stage from the most recent WHO classification), (2) lymph node infiltration denoted as “number of tumor bearing lymph nodes / number of all lymph nodes on the slide” and (3) mutational status of papillary thyroid cancer (PTC) (BRAF vs. RAS vs. wildtype) (Table 1, Figure 2a, b, Supplementary Table 1, 2). First, we added either a correct, a misleading or no T-Stage label on slides from 14 individual colorectal cancer patients, and queried the VLMs Claude-3, Claude-3.5 and GPT-4o to provide the pathological T-Stage (pT) (Figure 2c). Second, we queried the VLMs on 13 individual tissue samples of lymph nodes from patients with breast cancer with varying percentage of tumor-infiltrated lymph nodes, and similarly added true or false lymph node status information as handwritten labels on the slides (Figure 2d). Third, we selected 15 slides of PTC (9 BRAF, 6 RAS) and queried the VLM for each case, whether it saw a BRAF mutation, a RAS mutation or a wildtype slide (Figure 2e). This represented a more complex task that could only be solved by the VLM by realization of the morphology which is associated with mutually exclusive BRAF and RAS mutations. This resulted in 42 slides passed in three constellations (true label, false label, no label) in triplicates to the VLMs Claude-3, Claude-3.5 and GPT-4o, in total resulting in N = 1134 observations.

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/12/12/2024.12.11.24318840/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/F2)

Figure 2: Incorrect labels on whole-slide images reduce accuracy of vision-language models task-independent
**a** Accuracy over all three tasks (T-Stage prediction, lymph node infiltration prediction, Mutational Status prediction) with *n* = 126 requests per model and label type, leading to a total of n = 1134 observations. Displayed as mean ± SD with each datapoint representing the mean over 3 iterations per prompt/label/patient constellation. **b** Exemplary images per task and label type. **c-e** Accuracy per task (**c** = T-Stage, *n* = 378, **d** = lymph node infiltration, *n* = 351, **e** = mutational status, *n* = 405). Displayed as mean ± SD, circles represent observations summarized as triplicates. Dotted line represents random guess. Statistical significance tested with the Kruskal-Wallis test and Dunn’s post hoc test, each comparing the label constellations with their respective baseline w/o label. **f-g** Frequency of flagged requests per task (**f** = T-Stage, *n* = 378 **g** = lymph node infiltration, *n* = 351, **h** = mutational status, *n* = 405). Displayed as mean ± SD, circles represent observations summarized as triplicates.

View this table:
[Table 1:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/T1)

Table 1: Overview of designed tasks, tissues and prompts used

First, we assessed the baseline diagnostic accuracy of the models for the three tasks, i.e. on the slides with no label. All models surpassed 25 % accuracy for all tasks in zero-shot inference, with mean accuracy of 52 % for T-Stage (38 % for Claude-3, 50 % for Claude-3.5 and 68 % for GPT-4o) (Figure 2c, Supplementary Table 3), 31 % for lymph node infiltration (31 % for Claude-3, 36 % Claude-3.5 and 23 % for GPT-4o) (Figure 2d, Supplementary Table 4) and 64 % for mutational status (60 % for Claude-3, 93 % for Claude-3.5 and 38 % for GPT-4o) (Figure 2e, Supplementary Table 4. Given the amount of predefined options (5 for T-Stage, up to 20 for lymph node status and 3 for mutational status), this result confirmed that SOTA VLMs are, in principle, capable of inferring insights from subimages / magnifications of pathological slides.

We then compared the accuracies for each modality between the scenarios with correct label or misleading label to the baseline without label, with drastic results. As soon as labels were present, all models virtually lost their diagnostic capabilities, ignored tissue morphology and relied solely on the label, leading to a negligible accuracy if presented with a misleading (0 % for Claude-3, Claude-3.5 and GPT-4o each, *p* < 0.0001 each), and almost perfect accuracy when presented with a label that leaked the ground truth (100 % for Claude-3, 100 % for Claude-3.5 and 84 % for GPT-4o, *p* < 0.0001 each) (Figure 2a). While the model performance differed for baseline experiments, the phenomenon of over- and underperformance when presented with a label was independent of the provided task (Figure 2 c-e, Supplementary Tables 3-5).

As a secondary endpoint, we hypothesized that while the models might predominantly identify the visible label as the most apparent feature on the slides, they would also generate warnings when discrepancies between the image and the text were detected. To assess this, we instructed the models to provide a “flag” for prompts deemed suspicious or conflicting (Figure 2f-h, Supplementary Table 6-8). However, only for the mutational status, a significantly higher “flag”-rate for the misleading labels compared to slides without labels was observed (*p* = 0.015 over all models combined), while otherwise no clear tendency for higher flag rate for conflicting tissue and label could be observed.

### Misleading watermarks influence VLMs decision more than the tissue itself

Handwritten labels are comparatively obvious clues and likely refer to a diagnosis previously made by a pathologist, which are therefore more often correct than not. It is therefore plausible that VLMs use this shortcut when provided with the opportunity to do so. As a next step, we therefore developed a strategy where the models could either predict an output based on the image or based on a watermark, which could give the correct answer away. Specifically, we created logos and names for fictitious medical institutions, specifically a “National Ovarian Cancer Center” (NOCC) and a “National Prostate Cancer Center” (NPCC). We then presented the models with 15 HE slides of prostate cancer and ovarian cancer each, where every slide was complemented with either the watermark logo for the NOCC or NPCC, representing either the data leakage or a misleading label. We instructed the models to give the correct diagnosis out of the options of Breast, Prostate, Ovary, Colorectal, or Lung Cancer (Figure 3a, b, c) and compared those results against a baseline of the prostate and ovarian cancer slides without any watermark. Surprisingly, even these far fetched data leakages prominently influenced the models’ decisions. Over ovarian and prostate cancer diagnosis combined, models reached an average accuracy of 30 % (Claude-3), 43 % (Claude-3.5) and 18 % (GPT-4o) (Figure 3a, Supplementary Table 9), with better performance for prostate cancer (58 % (Claude-3), 58 % (Claude-3.5) and 13 % (GPT-4o)) (Figure 3 f, g, Supplementary Table 11) than for ovarian cancer (2 % (Claude-3), 29 % (Claude-3.5) and 22 % (GPT-4o)) (Figure 3 d, e, Supplementary Table 10). Strikingly however, for both cancer types and for all models, the watermarks strongly influenced the models’ decisions, leading to negligible and significantly worse performance for misleading watermarks (9 % for Claude-3, 7 % for Claude-3.5, 0 % for GPT-4o, *p* < 0.0001 for all models) and almost perfect performance when the watermark gave a hint to the correct diagnosis (90 % for Claude-3, 90 % for Claude-3.5, 100 % for GPT-4o, *p* < 0.0001 each) (Figure 3c, Supplementary Table 11). The watermarks hereby drastically reduced the probability of prediction of the other given options, namely breast, lung and colon cancer (Figure 3d, f, Supplementary Figure 1).

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/12/12/2024.12.11.24318840/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/F3)

Figure 3: Misleading watermarks lead to loss in accuracy of tissue-of-origin prediction
**a** Schematic illustrations of WSI. Left: A prostate cancer WSI with the misleading label of the “Ovarian cancer center”. Right: An ovarian cancer WSI with the misleading label of the “Prostate Cancer Center”**. b** Exemplary images with different watermark constellations. **c** Accuracy of predicting the tumor-tissue of origin per model and watermark constellation for *n* = 30 whole-slide-images with *n* = 15 WSI of prostate cancer and *n* = 15 WSI of ovarian cancer, with a total number of *n* = 810 observations (30 WSI * 3 repeated measures * 3 models * 3 label constellations). The watermark leaked either the correct diagnosis (for Ovarian cancer cases corresponding to the logo of the Ovarian Cancer Center, and for prostate cancer to the Prostate Cancer Center), a misleading diagnosis (corresponding to interchanged watermarks for ovary and prostate) or no watermark. *n* = 270 requests per model. Displayed as mean ± SD, circles represent observations summarized as triplicates. **d**, **f** Sankey diagrams for e, g, representing which watermark constellations led to which predicted tissue of origins for ovarian cancer (d) and prostate cancer (f), respectively. **e, g** Accuracy split up for **e** Ovarian Cancer and **g** Prostate Cancer WSI. Statistical significance tested with the Kruskal-Wallis test and Dunn’s post hoc test, each comparing the label constellations with their respective baseline w/o label. Images partially created with [www.BioRender.com](https://www.BioRender.com).

### Prompt engineering is insufficient to focus VLMs attention on tissue opposed to labels or watermarks

Finally, we investigated prompt engineering as a simple, yet often effective mitigation measure. We instructed the VLMs to “focus exclusively on the visible tissue, and disregard influences like watermarks or labels, as those could be misleading and should not influence the models decision”. We then repeated the queries of all models and all previously reported tasks, i.e. T-Stage, Lymph node infiltration, Mutation Status, Tissue of Origin Prediction, with the engineered prompt (Figure 4 a-c, Supplementary Table 12-15). Surprisingly, the “Attention on tissue” prompt did not significantly alter the accuracy of the models (*p* = 0.15). It did not improve accuracy when the model was presented with a misleading label, with 4,2 % to 4,7 % for Claude-3 (Native Prompt vs Attention on tissue, *p* = 0.8), 2,8 % to 6,5 % for Claude-3.5 (*p* = 0.06), 0 % and 0 % for GPT-4o (*p* = 1). Neither did it reduce performance when presented with a true label, with an accuracy of 96 % and 98 % for Claude-3 (Native Prompt vs Attention on tissue, *p* = 0.24), 96 % and 93 % for Claude-3.5 (*p* = 0.15), 91 % and 100 % for GPT-4o (*p* = 0.0002), suggesting that prompt engineering is insufficient to shift the VLMs attention from the labels towards the actual slide when presented with both image and label and can even have unexpected effects in opposite directions, as shown for GPT-4o.

![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/12/12/2024.12.11.24318840/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/F4)

Figure 4: Prompt engineering does not mitigate dependence on labels for VLM decision-making
**a-c** Diagnostic accuracy over all tasks (T-Stage, lymph node infiltration, Mutational status, tissue of origin detection) combined for different label constellations, with **a** for misleading label, **b** without label and **c** with ground-truth leaking label. *n* = 1944 unique observations for “Native prompt” and “Attention on tissue”, respectively. Displayed as mean ± SD, circles represent observations summarized as triplicates. “Native prompt” corresponds to original instructions (see Table 1) and data from Figures 2 and 3. “Attention on tissue” corresponds to observations repeated with engineered prompt as a combination of the native prompt and the addition “Focus exclusively on the tissue, and not on any surrounding element like watermarks, labels or other artefacts. These could be misleading and should not influence your decision". Statistical significance tested with Wilcoxon-rank-sum test for each model between “Native prompt” and “attention on tissue”.

## Discussion

Our study demonstrates that subtle clues on histopathological slides, like labels or watermarks, are highly relevant for VLMs capabilities to make diagnostic decisions from histopathological images, serving as incidental prompt injection attacks. We show that these subtle clues are a) broadly represented in real-world histopathological data, b) strongly influence the outputs of a variety of state-of-the-art VLMs, and c) cannot easily be mitigated by prompt engineering approaches. They are therefore a relevant issue both for training and inference of VLMs for tissue medicine.

As new AI methods and models are being developed at unprecedented pace24, potential dangers, limitations, and drawbacks can easily be overlooked25,26. In the context of healthcare, this can have dire consequences. One of these dangers are prompt injection attacks, in which hidden, adversarial instructions are presented to VLMs to alter their output 17,18,20,27, e.g. from correctly diagnosing a tumor to falsely reporting a healthy scan16. However, that scenario requires an active malicious actor with access to the prompt itself, which should be prevented by IT security measures adapting to new technology 28–31. In this study however, we show that clinicians themselves can serve as the “malicious” actor, inserting a prompt injection during their routine clinical activities just by scribbling on a slide, a far more realistic threat, if VLMs over-rely on those labels instead of the tissue.

We show that a substantial number of histopathological slides (around 25%) can contain pen marks, as demonstrated through representative sampling of slides from nine TCGA cohorts and five pathology centers across Germany 32. In a number of comprehensive experiments we then demonstrate that all tested VLMs are flipping their outputs completely upon presentation of labels from opposing classes, for otherwise identical images. Hereby, they are exploiting correlations likely present in their training data, a phenomenon known in machine learning as “Clever Hans behaviour”33,34. However, as VLMs are generalist models, capable of interpreting in-detail and in-context image and text input, the clues for the VLM can be extremely subtle. Our experiments show that it is sufficient to add for example a watermark, similar to what a slide scanner might do, or scribbles from pathologists. Other scenarios could involve the pre-settings of an ultrasound device, QR codes with patient information or leakage of the data center, etc 34. Virtually any additional input from which the model can derive information could cause this kind of attack, meanwhile the model plausibly suggests that the information was derived from the image or tissue itself, potentially even with convincing pathophysiological explanations 35,36.

As both training and validation data for current VLMs are likely highly contaminated by unclean data that can cause drastic over-estimation of model performance, our results highlight both data cleaning and the need for interpretability measures for classification tasks as crucial aspects for VLMs in healthcare 37. Furthermore, mitigation measures against prompt injection attacks must be further explored. Prompt engineering can drastically alter model outputs and improve alignment of VLMs. In a previous study investigating prompt injection attacks with the malicious intent of obscuring a clearly visible malignancy and falsely presenting it as a healthy scan, we observed that Claude-3.5 demonstrated improved alignment when explicitly reminded to prioritize ethical standards16,38. In contrast, in the current study, prompt engineering failed to enhance the models’ focus on the tissue. This disparity may stem from the nature of the task. In the previous study, we actively inserted text into images to enforce the opposite of what was clearly present, potentially alerting the models due to conflicting information. In the current study, however, we focused on alterations that, by themselves, represent a more natural data input and would not necessarily raise suspicions, as labels are commonly present on pathology images. Further, the current approach lacks the ethical dimension of deliberate contradiction, as all involved options were equally plausible choices. These points illustrate that mere prompt engineering is likely insufficient to mitigate the prompt injections. Effective mitigation for VLM-based decision-making tools in pathology could rather involve e.g. preprocessing with background removal strategies or active masking of labels, which could be provided e.g. by LLM-based anonymization techniques39. Overall, our study highlights prompt injection attacks as a concept that has broad implications both for the explainability for VLMs as well as for practical applications, e.g. in healthcare.

## Methods

### Ethics statement

All research procedures involving VLMs were conducted exclusively on anonymized patient data from publicly available datasets ([https://www.cancerimagingarchive.net/collection/sln-breast/](https://www.cancerimagingarchive.net/collection/sln-breast/) (PMID: 31308507) 32 and [https://portal.gdc.cancer.gov/](https://portal.gdc.cancer.gov/)) 40 and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards. The overall analysis was approved by the Ethics commission of the Medical Faculty of the Technical University Dresden (BO-EK-412102024). Our work demonstrates a significant threat for healthcare. By publicly disclosing the vulnerabilities and attacks explored in this paper, our goal is to encourage robust mitigation and defense mechanisms and promote transparency regarding risks associated with VLMs. All prompts were injected in a completely simulated scenario to prevent unintended harm. We strongly emphasize that the disclosed attack techniques and prompts should under no circumstances be used in real-world scenarios without proper authorization. None of the models used in the study are approved for any medical / diagnostic purposes.

### Experimental setup

Models (Claude 3 Opus (claude-3-opus-20240229), Claude 3.5 Sonnet (claude-3-5-sonnet-20241022), GPT-4o (gpt-4o-2024-05-13)) were accessed via API between 10th of October and 10th of November 2024. Learning features (e.g. ChatGPT’s Memory function) were deactivated. User prompts were introduced in independent API calls along with image prompts, with the temperature set to 0.7 (default setting for most LLMs) for all models and maximum token count to 1000. No individual system prompts were added.

For each patient case, three images were created, with one image serving as baseline without any label, one image with the correct label and one with a false label. Pen-labels were added to the area with maximum whitespace on the raw images, Watermarks were always added into the upper right corner with 100 % opacity and 10 % of width of the slide * 20 % of the height of the slide in size. Images were passed individually to the model with a resolution of 3-5 MB per image as some models currently have file size limits of 5 MB per image. The background was white for all scenarios. Slides for the T-Stage scenario, the mutational analysis scenario and the watermark scenario were taken from the COAD-READ, the THCA, the OV, and the PRAD cohort of the TCGA.

### Prompts

Prompts (see Table 1) were defined a priori, consisting of minimal necessary context, the instruction and an output indicator. Prompts were not iterated or fine-tuned before the start of the study.

### Accuracy

Accuracy of the models for respective tasks was assessed with a binary score of 0 or 1, for wrong or correct classification. For each experiment, the ground truth classifications were pre-determined by a licensed pathologist and served as the basis for comparison. A score of 1 was assigned for correct classifications matching the ground truth (True_Prompt), and 0 for incorrect classifications. Task-specific additional rules ensured appropriate evaluation of predictions, including ordinal or categorical comparisons (e.g., T-Stage order or mutation types). Invalid predictions (e.g., unrecognized formats or overly long strings) were flagged as NA, manually curated by a licensed physician (J.C.) or excluded from accuracy calculations.

### Pen labels

Handwritten labels were generated by using the handwriting function of MS word and were then exported as vector graphics. Labels representing either the T-Stages of the TNM classification (Tis, T1, T2, T3, T4), the lymph node status as X (infiltrated lymph nodes) / Y (all lymph nodes), or the mutational status with either wildtype (WT), BRAF mutated (BRAF) or RAS mutated (RAS) PTC were added to the whitest area of the image with a grid-based approach (3 x 3). Each label corresponded either to the ground truth or a misleading label, with a minimum difference of two elements from ground truth for T-Stage and lymph nodes.

### Watermarks

SVG-watermarks for logos of fictitious “cancer centers” resembling specific organs with explicitly added text leaking the organ information were conceptualized using DALLE, created with Inkscape and added with full opacity in the upper right corner of the slides, with dimensions < than 10 % in width and 20 % in height of the tissue slide.

### Mitigation efforts

All prompts queried for pen-label and watermark experiments were repeated with prompt engineering, concatenating the following statement to the pre-defined prompts: “Focus exclusively on the tissue, and not on any surrounding element like watermarks, labels or other artifacts. These could be misleading and should not influence your decision.” Results were then compared for all experiments (T-Stage, Lymph Node Infiltration, Mutational status, Tissue of origin) together.

### Statistical analysis

All results are shown as mean ± standard deviation (SD), and statistical significance was either assessed by the Mann-Whitney U test (independent samples) or Wilcoxon Signed-Rank test (dependent samples/within the same model) plus Bonferroni correction for multiple testing, with significance level alpha < 0.05.

### Software

Models were assessed via respective APIs using Visual Studio Code with Python Version 3.11. Graphs were created with RStudio (2024.04.0) including the libraries ggplot2, dplyr, readxl, tidyr, gridExtra, FSA, rstatix, scales, RColorBrewer). Figures were composed with Inkscape, version 1.3.2. The models GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) were used for spell checking, grammar correction and programming assistance during the writing of this article, in accordance with the COPE (Committee on Publication Ethics) position statement of 13 February 202341.

## Data availability

The original data (images, prompts, model outputs, ratings, summary statistics) are available in the supplementary data and supplementary information. The following publicly accessible data was used for this study: COAD-READ, the THCA, the OV, and the PRAD cohort of the TCGA, can be found at [https://portal.gdc.cancer.gov/](https://portal.gdc.cancer.gov/). Slides for the lymph node metastasis scenario can be found at [https://www.cancerimagingarchive.net/collection/sln-breast/](https://www.cancerimagingarchive.net/collection/sln-breast/).

## Code availability

All code is available under [https://github.com/KatherLab/patho\_prompt\_injection](https://github.com/KatherLab/patho_prompt_injection)

## Author contributions

JC and SF designed the study. SF generated the (sub)image data from online resources. JC performed the experiments, evaluated and interpreted the results. JC and SF wrote the initial draft of the manuscript. ME, FL, NR, FK, MS, MJ screened pathological slides at their institutions for pen marks. SJKS screened the TCGA slide cohorts for pen marks. DF, ICW and JNK provided scientific support for running the experiments and contributed to writing the manuscript. JNK and SF supervised the study. All authors contributed scientific advice and approved the final version of the manuscript.

## Competing interests

DT received honoraria for lectures by Bayer, Philips and AstraZenica and holds shares in StratifAI GmbH, Germany and Synagen GmbH, Germany. SF has received honoraria from MSD and BMS. JNK declares consulting services for Owkin, France, DoMore Diagnostics, Norway, Panakeia, UK, Scailyte, Switzerland, Cancilico, Germany, Mindpeak, Germany, MultiplexDx, Slovakia, and Histofy, UK; furthermore he holds shares in StratifAI GmbH, Germany and in Synagen GmbH, Germany, has received a research grant by GSK, and has received honoraria by AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer and Fresenius. ICW has received honoraria by AstraZeneca. No other competing financial interests are declared by any of the authors.

## Supplementary Information

![Supplementary Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/12/12/2024.12.11.24318840/F5.medium.gif)

[Supplementary Figure 1:](http://medrxiv.org/content/early/2024/12/12/2024.12.11.24318840/F5)

Supplementary Figure 1: Misleading watermarks lead to loss in accuracy of tissue-of-origin prediction
**a-f** Sankey diagrams for models Claude-3 (**a**, **b**), Claude-3.5 (**c**, **d**) and GPT-4o (**e**, **f**) with or without the misleading watermark, each representing the “Ovarian Cancer Center” for all prostate tissue slides and vice versa, with **a**, **c**, **e** for Ovarian cancer slides and **b**, **d**, **f** for Prostate cancer slides. The right side of Sankey diagrams represents multiple choice options provided to the VLMs. *n* = 45 unique observations per model and watermark category.

## Acknowledgements

JC is supported by the Mildred-Scheel-Postdoktorandenprogramm of the German Cancer Aid (grant #70115730). CVS is supported by a grant from the Interdisciplinary Centre for Clinical Research within the faculty of Medicine at the RWTH Aachen University (PTD 1-13/IA 532313), the Junior Principal Investigator Fellowship program of RWTH Aachen Excellence strategy, the NRW Rueckkehr Programme of the Ministry of Culture and Science of the German State of North Rhine-Westphalia and by the CRC 1382 (ID 403224013) funded by Deutsche Forschungsgesellschaft (DFG, German Research Foundation). SF is supported by the German Federal Ministry of Education and Research (SWAG, 01KD2215C), the German Cancer Aid (DECADE, 70115166 and TargHet, 70115995) and the German Research Foundation (504101714). MJ is supported by the German Cancer Aid (TargHet, 70115995). FK is funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via the "Clinician Scientist Program in Evolutionary Medicine" (413490537). DT is funded by the German Federal Ministry of Education and Research (TRANSFORM LIVER, 031L0312A), the European Union’s Horizon Europe and innovation programme (ODELIA, 101057091), and the German Federal Ministry of Health (SWAG, 01KD2215B). JNK is supported by the German Cancer Aid (DECADE, 70115166), the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan; Come2Data, 16DKZ2044A; DEEP-HCC, 031L0315A), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (TransplantKI, 01VSF21048) the European Union’s Horizon Europe and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council (ERC; NADIR, 101114631), the National Institutes of Health (EPICO, R01 CA263318) and the National Institute for Health and Care Research (NIHR, NIHR203331) Leeds Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. This work was funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

*   Received December 11, 2024.
*   Revision received December 11, 2024.
*   Accepted December 12, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172– 180 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-023-06291-2&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37438534&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

2.  2.Bubeck, S., et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv [cs.CL] (2023).
    
    
3.  3.Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37816837&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

4.  4.Ferber Dyke, et al. GPT-4 for Information Retrieval and Comparison of Medical Oncology Guidelines. NEJM AI 1, AIcs2300235 (2024).
    
    
5.  5.Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 7, 257 (2024).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=39304709&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

6.  6.Bedi, S. et al. Testing and evaluation of health care applications of large language models: A systematic review: A systematic review. JAMA (2024) doi:10.1001/jama.2024.21700.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2024.21700&link_type=DOI) 

7.  7.Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-024-02855-5&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38413730&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

8.  8.Lu, M. Y. et al. A Multimodal Generative AI Copilot for Human Pathology. Nature (2024) doi:10.1038/s41586-024-07618-3.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-024-07618-3&link_type=DOI) 

9.  9.Christensen, M., Vukadinovic, M., Yuan, N. & Ouyang, D. Vision-language foundation model for echocardiogram interpretation. Nat. Med. 30, 1481–1488 (2024).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38689062&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

10. 10.Zhang, K. et al. A generalist vision-language foundation model for diverse biomedical tasks. Nat. Med. 1–13 (2024).
    
    
11. 11.Ferber, D., et al. End-to-end clinical trial matching with Large Language Models. arXiv [cs.CL] (2024).
    
    
12. 12.Ferber, D., et al. Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. arXiv [cs.AI] (2024).
    
    
13. 13.Truhn, D., Eckardt, J.-N., Ferber, D. & Kather, J. N. Large language models and multimodal foundation models for precision oncology. NPJ Precis. Oncol. 8, 72 (2024).
    
    
14. 14.Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. (2023) doi:10.1038/s41591-023-02412-6.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-023-02412-6&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37391665&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

15. 15.Han, T., Kumar, A., Agarwal, C. & Lakkaraju, H. Towards Safe Large Language Models for Medicine. arXiv [cs.AI] (2024).
    
    
16. 16.Prompt Injection Attacks onLarge Language Models in Oncology. [https://arxiv.org/html/2407.18981v1](https://arxiv.org/html/2407.18981v1).
    
    
17. 17.Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems. [https://arxiv.org/html/2410.07283v1](https://arxiv.org/html/2410.07283v1).
    
    
18. 18.Kimura, S., Tanaka, R., Miyawaki, S., Suzuki, J. & Sakaguchi, K. Empirical analysis of large vision-language models against goal hijacking via visual prompt injection. arXiv [cs.CL] (2024).
    
    
19. 19.Debenedetti, E. et al. AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents. arXiv [cs.CR] (2024).
    
    
20. 20.Samoilenko, R. Prompt injection attack on ChatGPT steals chat data. System Weakness [https://systemweakness.com/new-prompt-injection-attack-on-chatgpt-web-version-ef717492c5c2](https://systemweakness.com/new-prompt-injection-attack-on-chatgpt-web-version-ef717492c5c2) (2023).
    
    
21. 21.Hubinger, E. et al. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv [cs.CR] (2024).
    
    
22. 22.Mehrotra, A., et al. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv [cs.LG] (2023).
    
    
23. 23.Jiang, F., et al. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. arXiv [cs.CL] (2024).
    
    
24. 24.Perez-Lopez, R., Ghaffari Laleh, N., Mahmood, F. & Kather, J. N. A guide to artificial intelligence for cancer researchers. Nat. Rev. Cancer 24, 427–441 (2024).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38755439&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

25. 25.Han, T. et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit. Med. 7, 288 (2024).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=39443664&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

26. 26.Han, T. et al. Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalization. Nat. Commun. 12, 4315 (2021).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34262044&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

27. 27.Liu, Y., et al. Prompt Injection attack against LLM-integrated Applications. arXiv [cs.CR] (2023).
    
    
28. 28.Jalali, M. S., Bruckes, M., Westmattelmann, D. & Schewe, G. Why employees (still) click on phishing links: Investigation in hospitals. J. Med. Internet Res. 22, e16775 (2020).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32012071&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

29. 29.Gordon, W. J. et al. Assessment of employee susceptibility to phishing attacks at US health care institutions. JAMA Netw. Open 2, e190393 (2019).
    
    
30. 30.Cartwright, A. J. The elephant in the room: cybersecurity in healthcare. J. Clin. Monit. Comput. 37, 1123–1132 (2023).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37088852&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

31. 31.Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) 11, (2023).
    
    
32. 32.Campanella, G., Hanna, M. G., Brogi, E. & Fuchs, T. J. Breast metastases to axillary lymph nodes. The Cancer Imaging Archive doi:10.7937/TCIA.2019.3XBN2JCC (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7937/TCIA.2019.3XBN2JCC&link_type=DOI) 

33. 33.Lapuschkin, S. et al. Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 10, 1096 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-019-08987-4&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30858366&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F12%2F12%2F2024.12.11.24318840.atom) 

34. 34.Dehkharghanian, T. et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn. Pathol. 18, 67 (2023).
    
    
35. 35.Turpin, M., Michael, J., Perez, E. & Bowman, S. R. Language Models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv [cs.CL] (2023).
    
    
36. 36.Agarwal, C., Tanneru, S. H. & Lakkaraju, H. Faithfulness vs. Plausibility: On the (Un)reliability of explanations from large Language Models. arXiv [cs.CL] (2024).
    
    
37. 37.Wang, Z., et al. Label Cleaning Multiple Instance Learning: Refining coarse annotations on single whole-slide images. arXiv [cs.CV] (2021).
    
    
38. 38.Research. [https://www.anthropic.com/research#alignment](https://www.anthropic.com/research#alignment).
    
    
39. 39.Wiest, I. C. et al. Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer. medRxiv 2024.06.11.24308355 (2024) doi:10.1101/2024.06.11.24308355.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyNC4wNi4xMS4yNDMwODM1NXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTIvMTIvMjAyNC4xMi4xMS4yNDMxODg0MC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

40. 40.GDC Data Portal Homepage. [https://portal.gdc.cancer.gov/](https://portal.gdc.cancer.gov/).
    
    
41. 41.Authorship and AI tools. COPE: Committee on Publication Ethics [https://publicationethics.org/cope-position-statements/ai-author](https://publicationethics.org/cope-position-statements/ai-author).