skip to main content
10.1145/3368555.3384448acmconferencesArticle/Chapter ViewAbstractPublication PageschilConference Proceedingsconference-collections
research-article
Open access

Hurtful words: quantifying biases in clinical contextual word embeddings

Published: 02 April 2020 Publication History

Abstract

In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.

References

[1]
[n.d.]. Artificial Intelligence in Medicine. https://www.ibm.com/watson-health/learn/artificial-intelligence-medicine. Accessed: January 2020.
[2]
[n.d.]. Drug development gets big data analytics boost. https://www.novartis.com/stories/discovery/drug-development-gets-big-data-analytics-boost. Accessed: January 2020.
[3]
Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. 2019. Publicly Available Clinical BERT Embeddings. (2019). [arxiv]1904.03323 http://arxiv.org/abs/1904.03323
[4]
Emily A Arnold, Gregory M Rebchook, and Susan M Kegeles. 2014. Triply cursed: racism, homophobia and HIV-related stigma are barriers to regular HIV testing, treatment adherence and disclosure among young Black gay men. Culture, health & sexuality 16, 6 (2014), 710--722.
[5]
Christine Basta, Marta R Costa-Jussà, and Noe Casas. 2019. Evaluating the underlying gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.08783 (2019).
[6]
Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained Contextualized Embeddings for Scientific Text. (2019). [arxiv]1903.10676 http://arxiv.org/abs/1903.10676
[7]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289--300.
[8]
Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations. (2017). [arxiv]1707.00075 http://arxiv.org/abs/1707.00075
[9]
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems. 4349--4357.
[10]
Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, and Richard Zemel. 2018. Understanding the Origins of Bias in Word Embeddings. (2018). [arxiv]1810.03611
[11]
Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. 77--91.
[12]
Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why is my classifier discriminatory?. In Advances in Neural Information Processing Systems. 3539--3550.
[13]
Irene Y Chen, Peter Szolovits, and Marzyeh Ghassemi. 2019. Can AI Help Reduce Disparities in General Medical and Mental Health Care? AMA journal of ethics 21, 2 (2019), 167--179.
[14]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. (2019). [arxiv]1906.08101 http://arxiv.org/abs/1906.08101
[15]
Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 120--128.
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[17]
C Annette DuBard, Joanne Garrett, and Ziya Gizlice. 2006. Effect of language on heart attack and stroke awareness among US Hispanics. American journal of preventive medicine 30, 3 (2006), 189--196.
[18]
Harrison Edwards and Amos Storkey. 2015. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897 (2015).
[19]
Yanai Elazar and Yoav Goldberg. 2018. Adversarial removal of demographic attributes from text data. arXiv preprint arXiv:1808.06640 (2018).
[20]
Kevin Fiscella, Peter Franks, Mark P Doescher, and Barry G Saver. 2002. Disparities in health care by race, ethnicity, and language among the insured: findings from a national sample. Medical care (2002), 52--59.
[21]
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115, 16 (2018), E3635--E3644.
[22]
Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, and Rajesh Ranganath. 2018. Opportunities in Machine Learning for Healthcare. (2018). [arxiv]1806.00388 http://arxiv.org/abs/1806.00388
[23]
Laurent G Glance, Turner M Osler, Dana B Mukamel, J Wayne Meredith, Yue Li, Feng Qian, and Andrew W Dick. 2013. Trends in racial disparities for injured patients admitted to trauma centers. Health services research 48, 5 (2013), 1684--1703.
[24]
Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. arXiv preprint arXiv:1903.03862 (2019).
[25]
Nora Ellen Groce, Irving Kenneth Zola, et al. 1993. Multiculturalism, chronic illness, and disability. PEDIATRICS-SPRINGFIELD- 91 (1993), 1048--1048.
[26]
Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Advances in neural information processing systems. 3315--3323.
[27]
Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. 2019. Multitask learning and benchmarking with clinical time series data. Sci. Data 6, 1 (2019), 1--19. [arxiv]arXiv:1703.07771v2
[28]
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. 2018. Fairness Without Demographics in Repeated Loss Minimization. In International Conference on Machine Learning. 1934--1943.
[29]
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. (2019), 1--19. [arxiv]1904.05342 http://arxiv.org/abs/1904.05342
[30]
Wenke Hwang, Wendy Weller, Henry Ireys, and Gerard Anderson. 2001. Out-of-pocket medical spending for care of chronic conditions. Health affairs 20, 6 (2001), 267--278.
[31]
Vasileios Iosifidis and Eirini Ntoutsi. [n.d.]. Dealing with Bias via Data Augmentation in Supervised Learning Scenarios. ([n. d.]).
[32]
Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.
[33]
Alistair E. W. Johnson, Andrew A. Kramer, and Gari D. Clifford. 2013. A New Severity of Illness Scale Using a Subset of Acute Physiology And Chronic Health Evaluation Data Elements Shows Comparable Predictive Accuracy. Critical Care Medicine 41, 7 (July 2013), 1711--1718. 1530-0293
[34]
Alan E. Jones, Stephen Trzeciak, and Jeffrey A. Kline. 2009. The Sequential Organ Failure Assessment Score for Predicting Outcome in Patients with Severe Sepsis and Evidence of Hypoperfusion at the Time of Emergency Department Presentation. Critical care medicine 37, 5 (May 2009), 1649--1654. 0090-3493
[35]
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent trade-offs in the fair determination of risk scores. Leibniz Int. Proc. Informatics, LIPIcs 67 (2017), 1--23. 18688969 [arxiv]arXiv:1609.05807v2
[36]
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. arXiv preprint arXiv:1906.07337 (2019).
[37]
J. R. Le Gall, S. Lemeshow, and F. Saulnier. 1993. A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study. JAMA 270, 24 (Dec. 1993), 2957--2963. 0098-7484
[38]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
[39]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746 (2019).
[40]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
[41]
Kristian Lum and William Isaac. 2016. To predict and serve? Significance 13, 5 (10 2016), 14--19. 1740-9713
[42]
David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2018. Learning Adversarially Fair and Transferable Representations. (2018). [arxiv]1802.06309 http://arxiv.org/abs/1802.06309
[43]
Salimah H Meghani, Eeeseung Byun, and Rollin M Gallagher. 2012. Time to take stock: a meta-analysis and systematic review of analgesic treatment disparities for pain in the United States. Pain Medicine 13, 2 (2012), 150--174.
[44]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019).
[45]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119.
[46]
Kerry A Milner, Viola Vaccarino, Amy L Arnold, Marjorie Funk, and Robert J Goldberg. 2004. Gender and age differences in chief complaints of acute myocardial infarction (Worcester Heart Attack Study). The American journal of cardiology 93, 5 (2004), 606--608.
[47]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 220--229.
[48]
Gregory S Nelson. 2019. Bias in artificial intelligence. North Carolina medical journal 80, 4 (2019), 220--222.
[49]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532--1543.
[50]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227--2237.
[51]
Stephen Pfohl, Ben Marafino, Adrien Coulet, Fatima Rodriguez, Latha Palaniappan, and Nigam H. Shah. 2019. Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk. (2019), 271--278. [arxiv]arXiv:1809.04663v3
[52]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).
[53]
Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity. Annals of internal medicine (2018).
[54]
Bernard Rosner, Robert J Glynn, and Mei-Ling T Lee. 2006. The Wilcoxon signed rank test for paired comparisons of clustered data. Biometrics 62, 1 (2006), 185--192.
[55]
Melvin Sabshin, Herman Diesenhaus, and Raymond Wilkerson. 1970. Dimensions of institutional racism in psychiatry. American Journal of Psychiatry 127, 6 (1970), 787--793.
[56]
Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguistics 20, 1 (2008), 31--51.
[57]
Robert C Schwartz and David M Blankenship. 2014. Racial disparities in psychotic disorder diagnosis: A review of empirical literature. World journal of Psychiatry 4, 4 (2014), 133.
[58]
Laura B Shepardson, Howard S Gordon, Said A Ibrahim, Dwain L Harper, and Gary E Rosenthal. 1999. Racial variation in the use of do-not-resuscitate orders. Journal of general internal medicine 14, 1 (1999), 15--20.
[59]
Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association (2019). 1527-974X [arxiv]arXiv:1902.08691v4
[60]
Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. 2017. Clinical intervention prediction and understanding with deep neural networks. In Machine Learning for Healthcare Conference. 322--337.
[61]
Yi Chern Tan and L. Elisa Celis. 2019. Assessing Social and Intersectional Biases in Contextualized Word Representations. [arxiv]cs.CL/1911.01485
[62]
Eric J Topol. 2019. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 25, 1 (2019), 44.
[63]
Stacy A Trent, Erica A Morse, Adit A Ginde, Edward P Havranek, and Jason S Haukoos. 2019. Barriers to Prompt Presentation to Emergency Departments in Colorado after Onset of Stroke Symptoms. Western Journal of Emergency Medicine 20, 2 (2019), 237.
[64]
Effy Vayena, Alessandro Blasimme, and I Glenn Cohen. 2018. Machine learning in medicine: Addressing ethical challenges. PLoS medicine 15, 11 (2018), e1002689.
[65]
Tianlu Wang, Jieyu Zhao, Kai-Wei Chang, Mark Yatskar, and Vicente Ordonez. 2018. Adversarial Removal of Gender from Deep Image Representations. arXiv preprint arXiv:1811.08489 (2018).
[66]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
[67]
Sung Sug Yoon, Cheryl D Fryar, and Margaret D Carroll. 2015. Hypertension prevalence and control among adults: United States, 2011-2014. US Department of Health and Human Services, Centers for Disease Control and ….
[68]
Kun-Hsing Yu and Isaac S Kohane. 2019. Framing the challenges of artificial intelligence in medicine. BMJ Quality & Safety 28, 3 (2019), 238--241. 2044-5415 https://qualitysafety.bmj.com/content/28/3/238.full.pdf
[69]
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International Conference on Machine Learning. 325--333.
[70]
Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 335--340.
[71]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 629--634.
[72]
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496 (2018).
[73]
James Zou and Londa Schiebinger. 2018. AI can be sexist and racist - it's time to make it fair.

Cited By

View all
  • (2025)Examining inclusivity: the use of AI and diverse populations in health and social care: a systematic reviewBMC Medical Informatics and Decision Making10.1186/s12911-025-02884-125:1Online publication date: 5-Feb-2025
  • (2025)MLHOps: Machine Learning Health OperationsIEEE Access10.1109/ACCESS.2024.352127913(20374-20412)Online publication date: 2025
  • (2025)Stars, Stripes, and Silicon: Unravelling the ChatGPT’s All-American, Monochrome, Cis-centric BiasMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74630-7_19(283-292)Online publication date: 8-Feb-2025
  • Show More Cited By

Index Terms

  1. Hurtful words: quantifying biases in clinical contextual word embeddings

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning
      April 2020
      265 pages
      ISBN:9781450370462
      DOI:10.1145/3368555
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 April 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. algorithmic fairness
      2. clinical notes
      3. contextual language models
      4. machine learning for health
      5. natural language processing

      Qualifiers

      • Research-article

      Conference

      ACM CHIL '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 27 of 110 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)670
      • Downloads (Last 6 weeks)62
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Examining inclusivity: the use of AI and diverse populations in health and social care: a systematic reviewBMC Medical Informatics and Decision Making10.1186/s12911-025-02884-125:1Online publication date: 5-Feb-2025
      • (2025)MLHOps: Machine Learning Health OperationsIEEE Access10.1109/ACCESS.2024.352127913(20374-20412)Online publication date: 2025
      • (2025)Stars, Stripes, and Silicon: Unravelling the ChatGPT’s All-American, Monochrome, Cis-centric BiasMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-74630-7_19(283-292)Online publication date: 8-Feb-2025
      • (2024)Unmasking societal biases in respiratory support for ICU patients through social determinants of healthProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/821(7421-7429)Online publication date: 3-Aug-2024
      • (2024)Leveraging Temporal Trends for Training Contextual Word Embeddings to Address Bias in Biomedical Applications: Development StudyJMIR AI10.2196/495463(e49546)Online publication date: 2-Oct-2024
      • (2024)Natural Language Processing for Radiation Oncology: Personalizing Treatment PathwaysPharmacogenomics and Personalized Medicine10.2147/PGPM.S396971Volume 17(65-76)Online publication date: Feb-2024
      • (2024)12. Artificial Intelligence and Machine Learning in Research on Minority Health and Health DisparitiesRace and Research: Perspectives on Minority Participation in Health Studies, 2nd ed.10.2105/9780875533476ch12Online publication date: 11-Jun-2024
      • (2024)Peer review of GPT-4 technical report and systems cardPLOS Digital Health10.1371/journal.pdig.00004173:1(e0000417)Online publication date: 18-Jan-2024
      • (2024)Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative reviewBMC Medical Education10.1186/s12909-024-06048-z24:1Online publication date: 7-Oct-2024
      • (2024)Auditing GPT's Content Moderation Guardrails: Can ChatGPT Write Your Favorite TV Show?Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658932(660-686)Online publication date: 3-Jun-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media