RT Journal Article SR Electronic T1 SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2025.01.14.25320543 DO 10.1101/2025.01.14.25320543 A1 Jiang, Lan A1 Vorland, Colby J A1 Ying, Xiangji A1 Brown, Andrew W A1 Menke, Joe D A1 Hong, Gibong A1 Lan, Mengfei A1 Mayo-Wilson, Evan A1 Kilicoglu, Halil YR 2025 UL http://medrxiv.org/content/early/2025/01/15/2025.01.14.25320543.abstract AB Randomized controlled trials (RCTs) can produce valid estimates of the benefits and harms of therapeutic interventions. However, incomplete reporting can undermine the validity of their conclusions. Reporting guidelines, such as SPIRIT for protocols and CONSORT for results, have been developed to improve transparency in RCT publications. In this study, we report a corpus of 200 RCT publications, named SPIRIT-CONSORT-TM, annotated for transparency. We used a comprehensive data model that includes 83 items from SPIRIT and CONSORT checklists for annotation. Inter-annotator agreement was calculated for 30 pairs. The dataset includes 26,613 sentences annotated with checklist items and 4,231 terms. We also trained natural language processing (NLP) models that automatically identify these items in publications. The sentence classification model achieved 0.742 micro-F1 score (0.865 at the article level). The term extraction model yielded 0.545 and 0.663 micro-F1 score in strict and lenient evaluation, respectively. The corpus serves as a benchmark to train models that assist stakeholders of clinical research in maintaining high reporting standards and synthesizing information on study rigor and conduct.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by the National Library of Medicine of the National Institutes of Health under the award number R01LM014079. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funder had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication. This work used Bridges-2 and Ocean at Pittsburgh Supercomputing Center (PSC) through allocation CIS230380 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services Support (ACCESS) program45, which is supported by National Science Foundation, United States grants #2138259, #2138286, #2138307, #2137603, and #2138296.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesCode, data, and materials related to the searching and processing of PubMed search results and screening of articles are available from https://osf.io/8rg4h/. Code used for training and evaluating the models is available at https://github.com/ScienceNLP-Lab/RCT-Transparency/tree/main/SPIRIT-CONSORT-TM. https://osf.io/8rg4h/