Abstract
Background As a consequence of sports-related concussions, female athletes have been documented as reporting more symptoms than their male counterparts, in addition to incurring longer periods of recovery. However, the role of gender and its potential influence on symptom reporting and recovery outcomes in concussion management has not been completely explored.
Study Design This study investigates potential differential item functioning (DIF) related to gender biases within the SCAT3 symptom severity checklist. The data was obtained from the Federal Interagency of Traumatic Brain Injury Research (FITBIR), which included information from the 2014-2017 NCAA and DoD CARE Consortium. A total of 1,258 NCAA athletes (n=473 females and n=785 males) SCAT3 Symptom Severity sub-scores were analyzed across five time points post-concussion: less than six hours post-injury, 24-48 hours post-injury, asymptomatic, unrestricted return to play, and at 6 months.
Results During the recovery phase, women experienced more headaches, pressure in the head, and fatigue than male athletes. Overall, both male and female athletes had equivalent knowledge of concussions, and there was no significant difference in symptom-reporting ability for most items, including emotional-related symptoms. Only during the unrestricted return to play phase were group-level differences detected, with females being more likely to report more severe symptoms than males. However, upon further analysis, it was discovered females exhibit a relatively high difficulty level reporting symptom severity beyond ‘Mild’, therefore the group-level DIF may result from gender biases within the checklist.
Conclusion The present analysis posits that Differential Item Functioning (DIF) in specific symptoms may lead to gender bias. The findings of this study reveal that female athletes tend to exhibit symptomatic behavior upon returning to play, a phenomenon consistent with prior research. However, the possible DIF may provoke biases due to unreliable reporting measures within the SCAT3 symptom severity checklist. Furthermore, explaining why recent literature reports that female athletes do not present as symptomatic upon return to play. Additional research is warranted to determine whether females genuinely experience more symptoms or whether the presence of these potential assessment gender biases obstructs the manifestation of asymptomatic recoveries.
Introduction
Each year, over 460,000 student-athletes participate in collegiate sports competitions organized by the National Collegiate Athletic Association (NCAA)(Zuckerman et al., 2014). During the school years of 2009 to 2010 and 2013 to 2014, on average, 4.47 per 10,000 athletes experienced concussion exposures, which computes to 10,560 concussions annually, with women’s soccer being the second highest sport and rates in women’s soccer and volleyball having the most increased (Chandran et al., 2022). To manage and diagnose a sports-related concussion, the NCAA uses a unisex battery of assessments, one being a symptomatic presentation (Kroshus et al., 2021; Meehan et al., 2013; Dick 2009; McCrory et al., 2012; Harmon et al., 2013). However, several research studies have revealed that males and females exhibit different symptomatic patterns, both pre- and post-concussion, with females experiencing more reported concussions and prolonged recoveries. These inconsistent findings have led to the lack of clarity on the influence of gender on concussion management as well as recovery lengths (Broshek et al., 2005; Brown et al., 2015; Cantu & Register-Mihalik, 2011; Randolph et al., 2009).
If, as recent data suggest, a gender difference in sport-related concussion symptom reporting, it is still unclear why these differences arise. This is particularly consequential as symptom reporting plays a significant role in the decision-making process pertaining to return-to-play. A systematic review further examined gender differences in symptom reporting (D’Lauro et al., 2022), revealing that while it is unclear why women reported more symptoms, this possibly factored into why it took females a day longer to recover than male athletes. Similarly, one of the largest female-only analyses investigated this phenomenon within the NCAA and DoD CARE Consortium, revealing that asymptomatic presentation throughout recovery may not be a feasible expectation for female athletes, possibly due to limitations within current concussion management practices (Brown et al., 2015). Further investigation has indicated that clinical concussion assessments may be subject to sampling bias, as current assessments have been constructed around studies involving mostly male participants. (Gessel et al., 2007; Granito, 2002; Kieffer et al., 2021; Mihalik et al., 2009; O’Connor et al., 2017; Schatz et al., 2011; Wallace et al., 2017; Zuckerman et al., 2014). Therefore, generalizing head-injury and concussion assessment tools developed in males to the female athlete population is likely to present imprecise results.
Among the most widely used assessments is the Sport Concussion Assessment Tool (SCAT), (Echemendia et al., 2017) which is used to evaluate concussions in sideline, clinic, and hospital settings (McCrory et al., 2009, 2013; Pallant & Tennant, 2007). The SCAT combines several assessments, including a 22-item-graded symptom checklist, the Standard Assessment for Concussion (SAC), and a modified Balance Error Scoring System (mBESS) (Echemendia et al., 2017; Garcia et al., 2020). These components, initially chosen through consensus based on clinical experience and existing evidence, represent a robust combination of tests sensitive to concussion (Guskiewicz et al., 2013). Despite revisions to the SCAT, the symptom checklist has not been altered to include gender since the creation of the SCAT during the Second International Conference on Concussion in Sport in 2004, which is now in its sixth iteration (Yengo-Kahn et al. 2016; Chin et al., 2016).
Current concussion assessments and policies, to our knowledge, have not made any adjustments on how to manage gender in clinical settings. To address this issue, Item Response Theory (IRT) has emerged as a helpful approach, particularly with self-reported measures such as symptom reporting. IRT has recently gained popularity in abbreviating scales for concussion assessment and has been used to abbreviate various health-related and concussion outcomes while maintaining validity (Hamel et al., 2016; Heck et al., 2023; Langer et al., 2008). IRT can quantify gender-specific, shortened versions of multifaceted concussion assessments that are nearly as informative as longer ones (Angoff, 1981). In turn, identifying items that function differently across subgroups makes it a useful method for investigating gender differences in symptom reporting (Bollmann et al., 2018). IRT analysis is employed when a set of questionnaire items (or items from an administered scale) are intended to be summed together to provide a total score, which may include several subscale totals and an overall score, then be applied in the development of a new scale, where it is possible to design the item set to fit the model expectations from the outset.
Differential Item Functioning (DIF). DIF is particularly useful in identifying gender differences in symptom reporting (Kroshus et al., 2021; Langer et al., 2008; Tennant & Conaghan, 2007), as DIF can extract nuance gender differences in endorsing a given symptom after controlling for the overall scale score (Langer et al. 2008). The use of DIF has become an integral part of determining the validity and reliability of standardized tests, as DIF is not evident in symptoms when individuals from different groups have varying probabilities of responding in a certain way (Angoff, 1981; Langer et al., 2008; the EORTC Quality of Life Group and the Quality of Life Cross-Cultural Meta-Analysis Group et al., 2010). Rather, DIF becomes apparent when individuals from varying groups possessing the same true ability levels have varying probabilities of responding in a certain way. If, for example, in a symptom test, boys display higher probability of reporting symptoms more often than girls of equal ability level because the contents in the test items are biased against girls, then the assumption of this model includes unidimensionality and local independence (Ajeigbe & Afolabi, 2014). Unidimensionality occurs when each of the items in a test measures a single trait, for example, are all 22 symptoms on the SCAT3 symptom checklist measuring the latent trait severity recognition ability, which, in principle assumes local independence. Local independence is achieved when the probability to respond to items is independent of one another, suggesting that a response is based on the influence by no other symptom in the test.
The current study aims to investigate the SCAT3 symptom scores at various points during an athlete’s recovery process to identify any potential DIF at the item and group level that could be related to gender bias. Exploring potential DIF at the item and group level can reveal gender differences in symptom severity scoring that could impact the assessment tool’s validity. Ultimately, this study aims to provide insights into the potential impact of gender bias on SCAT3 Symptom Severity Checklist to advocate for the improvement of the accuracy and fairness of concussion assessments for female athletes.
Methods
Data from the NCAA and DoD CARE Consortium was obtained from the Federal Interagency of Traumatic Brain Injury Research (FITBIR). From 2014-2017, 35,00 student-athletes and service cadets were from 26 different institutions, 4 US service academies, and over 15 different sports were collected. Within the CARE Consortium (CARE Consortium Investigators et al., 2017) approximately 13,009 military cadets and 21,006 student-athletes from NCAA Division 1-Division 3 completed baseline preseason testing, with 8,356 female athletes included (45.1%). All athletes met the following criteria: (i) they were identified with a diagnosed sports-related concussion, and (ii) data at all five desired time points. A total of five time points were examined: (i) less than six hours post-injury, (ii) 24-48 hours post-injury, (iii) asymptomatic, (iv) unrestricted return to play, and (v) at 6 months. Consequently, N=1,258 NCAA athletes met the inclusion criteria, with 473 female subjects (n= 473) and 785 male subjects (n=785). The specific sports included in this assessment are known for high incidences of concussion, including men’s football and men’s and women’s ice hockey, lacrosse, soccer, and rugby.
Outcome Measures
The SCAT3 is a neurocognitive and symptom severity checklist based on a 7-point Likert scoring (0-No symptoms, 1-2-Mild, 3-4-Moderate, 5-6-Severe) (McCrory et al., 2009, 2013; Pallant & Tennant, 2007). For each item, a higher score indicates a more severe symptom (Echemendia et al., 2017). The SCAT3 was created to improve the original SCAT to assist in deciding when an athlete can safely return to play. The 22 items included were: Headache, Pressure in the Head, Neck Pain, Nausea/Vomiting, Dizziness, Blurry Vision, Balance Problem, Sensitivity to Light, Sensitivity to Noise, Feeling Slowed Down, Feel in a Fog, Don’t Feel Right, Difficulty Concentrating, Difficulty Remembering, Fatigue/Low Energy, Confusion, Drowsiness, Trouble Falling Asleep, More Emotional, Irritable, Sadness, and Anxious. The total score for each symptom is added to create a score out of 132, along with a score based on the total number of symptoms exhibited out of 22 (Begasse de Dhaem et al., 2017). Item response scales were rescaled to decompress the number of categories, as seen in Table 1, which reports assessment scoring scales, recoding measures, and corresponding adjusted response categories. The justification for compressing the response categories is to reduce categories with the same meaning despite different numbers (e.g., Mild = 1 and 2).
Statistical Analysis
This analysis used a model-based likelihood ratio test to identify DIF in a Partial Credit Model (PCM) through Winsteps Rasch Analysis and Methods Computer Software (Linacre, J. M. 2023). The PCM was chosen as it utilized flexible and well-suited Likert-type data such as the SCAT3 Symptom Severity Checklist. This item response modeling-based approach assumes the null hypothesis that the parameters for a particular item do not differ between groups. This method has an advantage in it isolates the parameters of an item, fitting a model with the parameters, allowing them to vary freely between groups, parameters constrained to be equal between groups, and uses as a test statistic that computes the difference between the loglikelihood values for the two models multiplied by −2 (Bollmann et al., 2018). The Winsteps Program anchors the item location, or difficulty (b) mean to zero. This compares the person’s ability or trait level scores, which are free to vary, in reference to the item scaling. Positive trait scores signify relatively higher performance, while negative trait scores indicate relatively lower performance, in reference to the items.
Item difficulty difference logits (d) was computed to see the impact on DIF within the SCAT3 symptom checklist and if it can be related to gender-bias. First, the widely used Mantel-Hazel (Dorans & Holland, 1992) method, which is a log-odds estimators of DIF size and significance from crosstabs of observations of the two genders. Second, DIF is estimated on a logit-difference, or logistic regression method, which estimates the difference between the symptom difficulties for the two groups, holding everything else constant. Post-hoc tests were done to prevent alpha inflation and Type I error brought about by multiple comparisons and to set the alpha level to a= 0.05, the Benjamini-Hochberg (B-H) procedure was utilized (Benjamini & Hochberg, 1995). The other comparison values are adjusted based on the rank order of the magnitudes for the observed p-values’ magnitudes, all falling within the range, based on the original statistical significance of DIF, a < 0.05.
DIF at the group level, or DGF, was computed for the interactions between classification groups of persons. The DGF Contrast will report the item’s difficulty difference between genders. DGF shows the difference in difficulty of the item between the two genders. The null hypothesis is that the two estimates are identical except for measurement error. Statistically significance DGF on an item, p ≤ 0.05. Therefore, a significant and positive DGF contrast indicates that the group-level differences which suggests more difficult for female athletes.
A Partial Credit Model (PCM; Masters, 1982 was used to estimate item category thresholds and corresponding expected posterior reliability at each time point. A PCM model estimates delta parameters (δij) are thresholds in which represent category difficulties and an item difficulty parameter (b) (Hays et al., 2000). Item difficulty (b) is computed by taking an average of all of the deltas (δij). As mentioned, an item reduction analysis was not performed, nor was it the focus of this analysis. The PCM was important to utilize as it demonstrated the current quality of the assessment without abbreviation; in turn, the PCM analysis will be used if DIF at the group level occurs. This will help determine whether each gender could accurately report symptom severity based on symptom difficulty. Posterior reliability measured by a Pearson correlation (r) coefficient (Akoglu, 2018), was also computed for both genders and all five time points.
Lastly, the high degree of unidimensionality was established before this analysis, which has demonstrated that the SCAT has a level of unidimensionality which implies that general concussion symptom severity is a coherent construct and could be estimated precisely using a subset of items from the full instrument. The present analysis assumes unidimensionality based on this validated bifactor approach (Nelson et al., 2018; Wilmoth et al., 2020; Brett et al., 2020). This approach is grounded on data indicating that concussion symptoms, as evaluated 24-48 hours post-injury by the SCAT3, are essentially unidimensional. A single underlying dimension, or a total symptom severity score, accounts for 96% of the reliable variance in ratings.
Results
At the item level, statistically significant DIF (See Table 2) was detected solely during the initial three time-intervals. Specifically, Item difficulty difference (d) that impacted DIF within six hours, male athletes experienced greater difficulty in managing somatic symptoms, including Headache (d= −0.27, p < 0.001), pressure in the Head (d= −0.29, p = 0.04), and Sensitivity to Light (d= −0.36, p = 0.04). Two of these symptoms exhibited notable DIF. Within the 24-48 hour time frame, we continue to see the same clinically relevant trends as before, with males experiencing greater difficulty in recognizing the severity of Pressure in Head (d= 0.2, p = 0.04), and females seem to experience greater difficulty cognitive based symptoms identifying symptoms such as Don’t Feel Right (d= 0.29, p < 0.001). Additionally, More Emotional (d= −0.83, p <0.001) was significantly more difficult for males than females in the emotional domain. While the other symptoms in the emotional domain did not show significant DIF, they suggested difficulty for both genders. Finally, during the asymptomatic period, only one symptom showed DIF within the cognitive domain with feeling fog (0.41, p < 0.001), indicating greater difficulty for males. DIF was not detected past the asymptomatic period.
During unrestricted return to play, the individual variability in symptom difficulty did not significantly impact the differential item functioning (DIF). However, the total symptom severity scores demonstrated that symptom difficulty varied significantly between genders (Table 3). Females reported symptom severity scores that were, on average, 38% higher than males (d = −0.38, p < 0.001). The presence of this DIF signifies that certain symptoms might be measuring different traits among genders, which could potentially lead to biased outcomes and misguided symptom severity scores.
As seen in Table 4, the assessment’s reliability is at its lowest during the unrestricted return to play (r=0.47 vs r=0.31) but at its highest during the 24-48 for both females and males (r=0.92 vs r=0.92). In Table 5, to further investigate if the DIF detected at the group level can be related to gender, trait levels were an unrestricted return to play. The Item(s) contained zero observations from males and were therefore dropped: these were Nausea/Vomiting, Balance Problem, Confusion, More Emotional, and Irritable. Table 5 displays the item location parameters and category thresholds for the PCM model during the unrestricted return to play period; The analysis revealed that females item location (b) is equal to or less than the difficulty of reporting mild symptom (δ1), the ability to report symptoms beyond this severity level would be difficult.
Figure 1 visually displays the outcome of the PCM analysis through a Person-Item Map. These plots show each genders abilities and item difficulties respectively along the same latent dimension. Dark circles can be that of the item difficulty and the open circles are category difficulty. For females a total of 6 symptoms, Headache, Neck Pain, Difficulty Remembering, Low Energy, Drowsiness and More Emotional. were more easily distinguishable between categories, however, the remaining 16 symptoms were too difficult for females to accurate report symptom severity. Additionally, 12 out of the 22 symptoms item difficulty was b < 0, indicating difficultly to categorize past ‘Mild’. For male athletes, the item difficulties for Headache, Pressure in Heath, Neck Pain, Sensitivity to light, Feeling Slowed Down, Difficulty Remembering, and Low Energy were relatively easy to identify and contained zero observations in the Moderate or Severe Category.
Discussion
Based on the findings of this IRT analysis, somatic and cognitive concussion symptoms appear to be more easily identifiable within 48 hours of the injury, regardless of gender, apart from symptoms of nausea and/or vomiting. Prior studies using the IRT model indicate that a condensed variant of a symptom checklist consisting of 3, 9, or 12-item checklists is just as effective as a longer version containing 22 items in identifying concussion (Randolph et al., 2009; Wilmoth et al., 2020). These studies found similar cognitive and somatic symptoms, such as feeling slow, foggy, and not right, being most identifiable. This is consistent with the present work, finding the most difficult symptoms for females to identify somatic symptoms (Headache, Pressure in Head, Neck pain, Nausea/Vomiting, Dizziness, Balance problems, Sensitivity to light, Sensitivity to noise, Fatigue or low energy) whereas emotional symptoms (Feeling like “in a fog”, Difficulty Concentrating, Difficulty Remembering, Confusion) manifested the most difficult domain. Additionally, in line with these studies, Headache was the best-performing item for low-severity symptoms, which supports this item’s important role in concussion assessment. This analysis further confirms that SCAT symptom checklist has a high degree of coherence, suggesting that general concussion symptom severity can be accurately estimated using a subset of items from the somatic and cognitive domain of the full instrument (Brett et al., 2020; Wilmoth et al., 2020).
Despite limited items detecting DIF, also aligning with these studies, this analysis revealed that emotion-related symptoms were not easily identified in either gender, particularly over 48 hours (Chin et al., 2016; Covassin & Elbin, 2011; Granito, 2002; Preiss-Farzanegan et al., 2009). Asking an athlete to report their emotional symptoms (More Emotional, Irritability, Sadness, Nervous or Anxious) concerning their concussion may be too non-specific in relation to concussion in both genders; thus, symptoms such as “Are you Sad” or “Feeling Irritable” may be too vague or difficult to report accurately. Given the cofounding of other variables within an athlete’s life (e.g., academic pressures, family, relationships, and so on), this may be a reason why these symptoms are more difficult to identify; therefore, removing the emotional domain could enhance the accuracy of symptom reporting and improve recovery and rehabilitation outcomes.
Results indicated that there may be a disparity in how male and females athletes report their symptoms during the final phase before returning to play. Which can arguably one of the most important time frames before an athlete returns to play. This disparity could be a contributing factor to the recent studies indicating that female athletes do not report symptoms as frequently as their male counterparts during this phase. The analysis also uncovered challenges with accurate symptom reporting for both male and female athletes during this phase. Additionally, in line with previous findings (Nelson et al., 2018; Wilmoth et al., 2020), it was discovered that the SCAT3 symptom checklist may not be a reliable assessment measure when applied to female athletes and their returning to play as the reliability of the assessment was found exceptionally unreliable. The data also revealed that male athletes did not perceive five symptoms at all, suggesting that they either did not experience these symptoms or were underreporting them. By considering these discrepancies and assessment issues, it is possible that gender biases may be influencing the results and creating an inaccurate perception of female athletes appearing more symptomatic during the return to play phase.
In determining clinical symptom presentation post-concussion, it will be important to consider how psychological and social factors influence gender biases at different stages of recovery (Beran & Scafide, 2022; Caccese, et al. 2023; CARE Consortium Investigators et al., 2018; Lempke, Oldham, et al. 2023; Sinnott et al., 2023). Researchers may want to expand on the current Rasch Partial Credit Model to account for additional parameters to understand better how much external factors influence accurate symptom reporting. Rasch modeling will be a useful tool for researchers in the concussion field to evaluate the relationship between sociological pressures, such as reporting intentions, and diagnostic measures, such as symptom presentation on the variability of recovery length.
Futhermore, unlike basic regression modeling, this modeling approach can be from two different assessments (e.g., ImPACT vs SCAT) and consider scale equating techniques (Kolen & Brennan, 2004; Simons et. al., 2022; Choi et. al., 2014). Rasch item response modeling approaches calibrate different scales and assessment onto the latent variable, thus accounting for measurement error and reducing potential bias (Kolen & Brennan, 2004). Previous studies attempted to validate this approach by in calibrating the SCAT and Rivermead Post Concussion Symptoms Questionnaire (RPQ), using fixed-parameter calibration allows for a single item calibration for all items, offering a more rigorously established cross-walk between the assessments. Findings revealed benefits of this modeling approach revealed more reliability and precision than other methods, which linked RPQ and SCAT total scores based on their relationship to the latent dimension (TBI-related symptom burden) that drives scores on both measures. Moreover, Rasch-based item response modeling is a robust, yet minimally used, modeling approach which can be beneficial in capturing this very important construct on the same scale as the items used to measure it. Until these assessments are analyzed to better reflect potential differential relationships of gender, it will be important to factor in certain symptomatic presentations when making return-to-play decisions, particularly for females.
Study Limitations
This research was conducted on a specific group of athletes participating in contact sports in the NCAA, such as football, lacrosse, field/ice hockey, soccer, and rugby. It is important to note these results cannot be generalized to youth, high school, or professional sports. Therefore, it would be beneficial to conduct further studies on larger, national samples that include different age groups, contact and non-contact, as well as youth, collegiate, and professional sports to better to understand the psychometric properties of SCAT short forms as they pertain to female athletes.
Conclusion
As a result of this IRT analysis, based on N=1,258 NCAA athletes, it is worth considering dthat the assumption that females experience more concussion symptoms and are at a higher risk of prolonged symptoms may not be entirely accurate. The findings of this analysis suggest that current concussion assessments may contain gender biases resulting in differences in symptom presentation, further justifying the idea that these concussion assessments based on male-based normative data cannot be generalized to the female athlete population. Moreover, these differences in symptom presentation may be causing female athletes to be considered as “Not Recovered” - potentially hindering their returning to field of play. Therefore, an investigation into constructing tailored, female-specific, symptom assessments may be an important first step in more accurately capturing the concussion recovery process in female athletes.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
Footnotes
Author Note: Karen M Schmidt https://www.linkedin.com/in/karen-schmidt-2607087/
The authors declare no financial conflicts of interest.
Federal Interagency of Traumatic Brain Injury Research (FITBIR) not Federal Interagency of Traumatic Brain Injury Repository