PT  - JOURNAL ARTICLE
AU  - Awasthi, Raghav
AU  - Mishra, Shreya
AU  - Mahapatra, Dwarikanath
AU  - Khanna, Ashish
AU  - Maheshwari, Kamal
AU  - Cywinski, Jacek
AU  - Papay, Frank
AU  - Mathur, Piyush
TI  - HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool
AID  - 10.1101/2023.12.22.23300458
DP  - 2023 Jan 01
TA  - medRxiv
PG  - 2023.12.22.23300458
4099  - http://medrxiv.org/content/early/2023/12/30/2023.12.22.23300458.short
4100  - http://medrxiv.org/content/early/2023/12/30/2023.12.22.23300458.full
AB  - Large language models (LLMs) have caught the imagination of researchers,developers and public in general the world over with their potential for transformation. Vast amounts of research and development resources are being provided to implement these models in all facets of life. Trained using billions of parameters, various measures of their accuracy and performance have been proposed and used in recent times. While many of the automated natural language assessment parameters measure LLM output performance for use of language, contextual outputs are still hard to measure and quantify. Hence, human evaluation is still an important measure of LLM performance,even though it has been applied variably and inconsistently due to lack of guidance and resource limitations.To provide a structured way to perform comprehensive human evaluation of LLM output, we propose the first guidance and tool called HumanELY. Our approach and tool built using prior knowledge helps perform evaluation of LLM outputs in a comprehensive, consistent, measurable and comparable manner. HumanELY comprises of five key evaluation metrics: relevance, coverage, coherence, harm and comparison. Additional submetrics within these five key metrics provide for Likert scale based human evaluation of LLM outputs. Our related webtool uses this HumanELY guidance to enable LLM evaluation and provide data for comparison against different users performing human evaluation. While all metrics may not be relevant and pertinent to all outputs, it is important to assess and address their use.Lastly, we demonstrate comparison of metrics used in HumanELY against some of the recent publications in the healthcare domain. We focused on the healthcare domain due to the need to demonstrate highest levels of accuracy and lowest levels of harm in a comprehensive manner. We anticipate our guidance and tool to be used for any domain where LLMs find an use case.Link to the HumanELY Tool https://www.brainxai.com/humanelyCompeting Interest StatementThe authors have declared no competing interest.Funding StatementNAAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesWe have not performed any data analysis in this paper.