RT Journal Article
SR Electronic
T1 OQA : A question-answering dataset on orthodontic literature
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2024.07.05.24309412
DO 10.1101/2024.07.05.24309412
A1 Rousseau, Maxime
A1 Zouaq, Amal
A1 Huynh, Nelly
YR 2024
UL http://medrxiv.org/content/early/2024/07/08/2024.07.05.24309412.abstract
AB Background The near-exponential increase in the number of publications in orthodontics poses a challenge for efficient literature appraisal and evidence-based practice. Language models (LM) have the potential, through their question-answering fine-tuning, to assist clinicians and researchers in critical appraisal of scientific information and thus to improve decision-making.Methods This paper introduces OrthodonticQA (OQA), the first question-answering dataset in the field of dentistry which is made publicly available under a permissive license. A framework is proposed which includes utilization of PICO information and templates for question formulation, demonstrating their broader applicability across various specialties within dentistry and healthcare. A selection of transformer LMs were trained on OQA to set performance baselines.Results The best model achieved a mean F1 score of 77.61 (SD 0.26) and a score of 100/114 (87.72%) on human evaluation. Furthermore, when exploring performance according to grouped subtopics within the field of orthodontics, it was found that for all LMs the performance can vary considerably across topics.Conclusion Our findings highlight the importance of subtopic evaluation and superior performance of paired domain specific model and tokenizer.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe main author would like to acknowledge the financial support provided by the University of Montreal Faculty of Dentistry to conduct this research. This research was enabled in part by support provided by Calcul Québec (https://www.calculquebec.ca/) and the Digital Research Alliance of Canada (alliancecan.ca).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced are available online at https://huggingface.co/datasets/m-rousseau/oqa-v1 https://huggingface.co/datasets/m-rousseau/oqa-v1 https://github.com/maxrousseau/o-nlp Both the OQA dataset and code to reproduce our experiments are made available under the open-source Apache-2.0 license.source code repository: https://github.com/maxrousseau/o-nlpdataset repository: https://huggingface.co/datasets/m-rousseau/oqa-v1