Abstract
Alterations in learning and decision-making systems are thought to contribute to core features of anorexia nervosa (AN), a psychiatric disorder characterized by persistent dietary restriction and weight loss. Instrumental learning theory identifies a dual-system of habit and goal-directed decision-making, linked to model-free and model-based reinforcement learning algorithms. Difficulty arbitrating between these systems, resulting in an over-reliance on one strategy over the other, has been implicated in compulsivity and extreme goal pursuit, both of which are observed in AN. Characterizing alterations in model-free and model-based systems, and their neural correlates, in AN may clarify mechanisms contributing to symptom heterogeneity (e.g., binge/purge symptoms). This study tested whether adolescents with restricting AN (AN-R; n = 36) and binge/purge AN (AN-BP; n = 20) differentially utilized model-based and model-free learning systems compared to a healthy control group (HC; n = 28) during a Markov two-step decision-making task under conditions of reward and punishment. Associations between model-free and model-based learning and resting-state functional connectivity between neural regions of interest, including orbitofrontal cortex (OFC), nucleus accumbens (NAcc), putamen, and sensory motor cortex (SMC) were examined. AN-R showed higher utilization of model-free learning compared to HC for reward, but attenuated model-free and model-based learning for punishment. In AN-R only, higher model-based learning was associated with stronger OFC-to-left NAcc functional connectivity, regions linked to goal-directed behavior. Greater utilization of model-free learning for reward in AN-R may differentiate this group, particularly during adolescence, and facilitate dietary restriction by prioritizing habitual control in rewarding contexts.
Introduction
Anorexia nervosa (AN) is a debilitating and potentially life-threatening disorder with one of the highest mortality rates among psychiatric conditions1–4. AN typically onsets in adolescence, with at least a third of affected individuals developing a chronic course of illness into adulthood5, yet mechanisms contributing to progression of illness are poorly understood. Recent conceptualizations posit a role of habit and goal-directed behavioral systems in the pathogenesis and maintenance of the disorder, although their relative contribution is debated6–9. Goal-directed control consists of flexible decision-making, in which the value of specific outcomes can change depending on shifting environmental contingencies. Conversely, habitual control consists of inflexible decision-making, in which choice behavior is difficult to change, even during instances of outcome devaluation10.
The central AN symptom of dietary restriction can be described as highly goal-directed, with persistent food restriction in service of long-term rewards (i.e., weight loss) indicating an effortful and goal-driven strategy. Alternatively, the rigidity with which individuals with AN uphold their caloric restriction and the cognitive inflexibility that is observed in their decision-making, might indicate greater use of habit-driven strategies. It has been proposed that what begins as a largely goal-directed pursuit of weight loss, driven by action-outcome learning (e.g., restriction becomes associated with weight loss outcomes), ultimately becomes habitual as repeated actions are overtrained and stimulus-response learning results in specific cues being automatically associated with disorder-relevant actions (e.g., palatable food cues become associated with restriction)7. Many individuals with AN also experience episodes of binge eating and purging, behaviors thought to reflect compulsivity (e.g., inflexible, repetitive behavior), and may too reflect greater reliance on habit. Thus, there is a question of how individuals with AN arbitrate between these systems and how this corresponds to restriction versus binge/purge symptomatology.
A compelling computational framework for this dual-system theory of goal-directed and habit decision-making exists in a class of reinforcement learning algorithms known as model-based and model-free learning11–13. These algorithms provide instantiations of neural computations that can be behaviorally modeled using variants of multistep Markov reward-based decision-making tasks that have been validated in both human14 and animal studies15.
Model-based learning algorithms consist of cognitive maps of state-action space that track environmental contingencies, such as transition probabilities when moving from one state to another, and consider these maps when selecting actions16. Consequently, this system is behaviorally akin to planning and, although flexible, imposes a considerable computational cost on the decision-maker17. The orbitofrontal cortex (OFC), which tracks value computations in response to feedback or shifting motivational and affective states18–21, and the ventromedial prefrontal cortex (vmPFC) are implicated in model-based learning22–25, with greater intrinsic resting functional connectivity between the medial OFC and nucleus accumbens (Nacc), which tracks prediction errors and anticipated reward and punishment outcomes26–29, correlated with greater model-based behavior24.
In contrast, model-free learning algorithms are linked to temporal difference learning frameworks, where value estimates based on outcomes are stored across trials and selected actions are dependent on the history of reward receipt, without consideration of environmental contingencies30. This system is similar to trial-and-error learning and is not easily adaptable, however it can be quite efficient, akin to habit31. Although neural signatures of model-based and model-free learning overlap in the medial prefrontal cortex and the ventral striatum, the putamen has also been implicated in habit behavior during sequential decision-making and outcome devaluation14,23,32–36, with putamen-supplementary motor are (SMA) functional connectivity, as well as increased neurite density in the left posterior putamen and right SMA, associated with model-free learning24.
Across psychiatric conditions, there is emerging evidence suggesting that disorders of compulsivity (e.g., obsessive-compulsive disorder, substance use) demonstrate deficits in model-based learning during sequential decision-making25,37–40. Critically, prior work examining model-based and model-free learning in a mixture of community and clinical samples endorsing symptoms of eating pathology (e.g., compulsive eating), also demonstrate attenuation in model-based learning25,37–39,41. Only one such study was conducted in a sample of individuals with full-threshold AN and found evidence of deficits in model-based learning for food and money that were not remediated at weight restoration, reflecting a potential trait behavioral marker of the disorder.
The current study aimed to characterize reliance on model-free and model-based reinforcement learning in AN by clarifying several key issues. First, because AN typically onsets during adolescence and is associated with high risk of chronicity, it is critical to uncover developmental mechanisms that contribute to maintenance of the disorder. Moreover, this developmental period consists of a time frame when model-based learning begins to come online42, highlighting a possible treatment target if the relative balance between model-free and model-based systems go awry during initial stages of psychiatric illness. Second, there is heterogeneity in the AN phenotype, with some patients exhibiting only dietary restriction (AN-R) and others also exhibiting binge/purge behaviors (AN-BP). Divergence of symptoms may suggest dissociable deficits in decision-making in individuals with AN-R versus AN-BP. Finally, while most studies have focused exclusively on reward learning, recent research suggests avoidance motivation promotes goal-directed learning43, and in OCD, learning bias shifts towards goal-directed control under conditions of loss44. Although reward deficits are well-established in AN45–50, altered punishment motivation and learning might be particularly salient51–55, as AN is characterized by high levels of harm avoidance, exaggerated neural response to losses, and worse treatment outcomes associated with poorer loss-related learning51,56,57. However, this may differ by subtype as eating disorders characterized by binge/purge symptoms demonstrate increased limbic-striatal reward response45–47,50,58,59, potentially indicating that both subtype and outcome valence may affect utilization of model-based and model-free learning systems.
To elucidate these questions, we conducted a monetary outcomes version of the two-step sequential decision-making task under conditions of reward and punishment in a sample of female adolescents with AN and demographically-matched healthy controls (HC). Participants also completed a separate resting-state functional magnetic resonance imaging scan (rsfMRI) to ascertain whether corticostriatal goal-directed and habit circuits were differentially associated with model-based and model-free learning in AN compared to HC. We hypothesized that adolescents with AN would have attenuated model-based learning compared to HC (especially AN-BP), consistent with increased compulsivity, and greater reliance on model-free learning (especially AN-R), consistent with the notion that dietary restriction is habit-based. With regards to neural functioning, we hypothesized that greater utilization of model-free learning would be linked to stronger functional connectivity between the putamen and SMA, while greater utilization of model-based learning would be linked to stronger functional connectivity between the OFC and NAcc. Finally, we explored whether model-based and model-free learning for reward and punishment were associated with clinical variables.
METHOD
Participants
Participants were all female and adolescent, with 36 meeting DSM-5 criteria for restricting type AN (AN-R; age range = 13.67-18.08), 20 meeting DSM-5 criteria for binge-purge type AN (AN-BP; age range = 13.08-18.17), and 28 being demographically matched healthy controls (HC; age range = 13.42-17.83). Approximately 5% of this sample (n = 4) completed the reinforcement learning task but did not complete the rsfMRI scan. One subject was excluded due to excessive motion during the rsfMRI scan. Participants included in resting-state analyses included 35 AN-R, 17 AN-BP, and 27 HC. Individuals with AN were recruited from the University of California, San Diego Eating Disorders Treatment and Research Program60, and healthy controls were recruited from the community. The study was approved by the University of California San Diego Institutional Review Board, and all participants were provided written informed consent (see Supplement for details regarding participants and assessments).
Procedure
Two-Step Sequential Decision-Making Task
Participants completed a modified two-step sequential decision-making task14, which was previously validated in an adolescent sample42. They performed the task under conditions of reward, where they either won $1 or won nothing, and punishment, where they either lost $1 or lost nothing (Figure 1A). The order of the conditions was counterbalanced within each group, with each condition consisting of four blocks and a total of 200 trials (see Supplement for task details).
Functional Magnetic Resonance Imaging (fMRI)
Resting-state fMRI was acquired separately on a 3.0 T GE MR750 scanner equipped with quantum gradients providing echo planar capability, using a Nova Medical 32 channel head coil (maximum gradient strength: 50 mT/m, slew rate: 200 T/m/s). An eyes-open protocol was implemented, with participants instructed to look at fixation crosshair for the duration. The 1 hour session included: a three-plane localizer; sagittally acquired (0.8 mm slice thickness, FOV=256×240 mm) T1-weighted (MPRAGE PROMO, TE=2.3 ms, flip angle=8o, matrix=320, 2x in-plane acceleration, scan time=7:50) and T2-weighted (3D CUBE, TE=60 ms, variable flip angle, matrix=320, 2x in-plane acceleration, scan time=6:51) whole brain scans for alignment and morphometry; and T2* weighted echo-planar imaging (EPI; 749 volumes, TR= 800 ms, TE=37 ms, flip angle=52°, FOV=180 (RL) x 208 (AP) mm, 72 2 mm thick axial slices, multiband factor=8).
Statistical Analysis
Analysis of Raw Choice Data
Consistent with prior work14,61, mixed-effects logistic regression analysis was performed on the raw choice data using lme4 package (Version 1.1-8)62 for the R software environment (Version 4.0.1)63. We estimated separate models for reward and punishment conditions. This model predicted first-stage choices (i.e., stay versus switch with respect to the previous response) as a function of previous outcome (i.e., won/lost in previous trial), previous transition type (i.e., rare/common), group, and their interaction. All two- and three-way interactions were defined as fixed effects, with a random intercept and subject-level adjustments to outcome, transition, and their interaction set as random slopes. The lme4 syntax is as follows: From this analysis, we estimated a model-free effect, approximated by a main effect of previous outcome on stay probability, as well as a model-based effect, approximated by a significant outcome-by-transition interaction.
Percentage of optimal choices was calculated for the second stage decision, specifically the choice between two second-stage stimuli. This was done separately for each condition and for every subject. Means for each group, and for each condition, were subjected to one sample t-tests, comparing to chance (i.e., 50%). Reaction time data for each stage and condition were compared between groups. Results for these analyses are presented in the Supplement (Figures S1-S2).
Computational Model
We fit participants’ choice data to a popular reinforcement learning model previously used to model two-stage choice data14,42,61. In this “hybrid” model, participants’ choices are theorized to reflect a weighted combination of model-free and model-based learning. This model has been well-validated in a range of psychiatric disorders, with prior work establishing its reliability in a range of psychiatric disorders37. The degree to which participants’ responses are governed by principles of model-free or model-based learning is controlled, respectively, by the βMF and βMB parameters. Both parameters are bounded at zero and larger values reflect increased model-free or model-basedness.
These model-based and model-free weight parameters were estimated with a hierarchical Bayesian model-fitting technique. To better understand how diagnostic group membership affected model parameters of interest, these parameters were determined on the basis of group membership (akin to regression) estimating coefficients (γ) that encoded how these parameters differed between each group (HC, AN-R, AN-BP): where is_R and is_BP are dummy-coded variables reflecting whether participant j was (or was not) in the AN-R or AN-BP subgroup respectively. Thus, γ0 encoded model weights in HCs and the other y coefficients tested for differences between HCs and AN subgroups.
To assess the statistical reliability of the observed parameter estimates, a 95% credibility interval (CI) was computed for each parameter. If the CI was wholly situated above or below zero, we determined with 95% confidence that the corresponding parameter was principally positive or negative, respectively. Moreover, we computed Bayesian p-values (P), which represent one minus the proportion of the posterior that falls above or below zero (depending on the sign of the median posterior value: below zero if b < 0 and above if b > 0). In line with the traditional interpretation of frequentist p-values, Bayesian p-values can be interpreted probabilistically as “there is a (P×100) percent chance that the effect is zero or a reversal of the central tendency.” Detailed information about the model can be found in the Supplement.
Exploratory analyses with clinical variables
We extracted subject-level random effects from the hierarchical estimation procedure described above to examine the relationship between our computational parameters of interest (βMB and βMF) and clinical variables. These random effects estimates were non-linearly transformed before additional analyses, as they were not normally distributed as indicated by Shapiro-Wilk tests (ps<.001). To transform the data, we converted the estimates to percentile scores and applied a probit link function, converting to z-scores, which improved normality of the distributions (ps>.05). Transformed estimates were used as dependent variables in a series of mixed-effects regression analyses with a group-by-condition-by-measure 3-way interaction term. All analyses controlled for z-scored BMI, IQ, and age as additional covariates. Analyses were conducted and estimated using the R lme4 package and family-wise Bonferroni-corrected to account for the exploratory nature of the analyses. Mixed-effects linear regression analyses were conducted with the following measures: Eating Disorder Examination (EDE), State-Trait Anxiety Inventory (STAI), Beck Depression Inventory – II (BDI-II), Temperamental Character Inventory – Harm Avoidance (TCI), Sensitivity to Punishment and Reward Questionnaire (SPSRQ), Behavioral Inhibition/Behavioral Activation Scales (BIS/BAS), and Adult Temperament Questionnaire – Effortful Control (ATQ; see Supplement for descriptions). The analysis involving the EDE was only conducted in AN-R and AN-BP, as the majority of HC had total scores of zero.
Resting-State Functional Magnetic Resonance Imaging Analysis
Images were processed using scripts developed by the Human Connectome Project freely available online (https://github.com/Washington-University/HCPpipelines). We adapted the HCP quality control metrics64, and HCP preprocessing requirements65, prior to using the HCP Workbench. Functional images were processed to remove spatial distortion, motion corrected, aligned to the subject’s structural volume, bias field corrected, transformed to standard space, and smoothed using FSL and other tools developed for the HCP protocol. ICA + FIX66 was applied to remove spatially specific structured noise, such as motion and transient head movement due to swallowing67. Subjects were excluded if any one parameter’s average exceeded 2 standard deviations from the mean to ensure that group differences were not due to motion. One subject was removed from the dataset following preprocessing due to excessive motion.
To investigate whether resting-state functional connectivity is associated with model-based and model-free learning, we conducted an ROI-to-ROI analysis on an a priori network. Specifically, we selected regions that have been implicated in both habit and goal-directed learning, including the OFC, NAcc, bilateral putamen, and SMA. Connectivity analyses were conducted in nilearn, a python-based brain imaging toolbox (https://nilearn.github.io/index.html). Anatomical ROIs were identified using the Harvard-Oxford Cortical and Subcortical Atlases68–71. Timeseries from each ROI were extracted and subjected to ROI-to-ROI analysis. Connectivity values (Fisher z-transformed correlation coefficients) between OFC-NAcc and putamen-SMA dyads were extracted from the connectivity map and entered into mixed-effects linear regression models. All analyses utilized transformed subject-level parameters of interest from the computational model. To examine links between model-based behavior and associated neural regions, model-based estimates (βMB) were set as the dependent variable with OFC-NAcc connectivity, group, condition, a connectivity-by-group-by-condition 3-way interaction, z-scored age, BMI, and IQ entered as independent variables. We specified the same model for model-free estimates (βMF) and putamen-SMA connectivity to examine associations between behavioral and neural regions implicated in habit learning. Because these analyses are linked to a priori hypotheses about the relationships and direction of relationships, we did not correct for multiple comparisons. Additionally, ROI-to-ROI connectivity was represented using plotting utilities from nilearn (Figure 6C).
RESULTS
Sample Characteristics
Groups did not differ on age, education, race, ethnicity, IQ, or length of illness (for AN-R and AN-BP only); however, they did differ on BMI, lowest lifetime BMI (for AN-R and AN-BP only), anxiety (STAI-T and STAI-S), depression (BDI), and eating disorder severity (EDE; Table 1).
Raw Choice Data
Model-based and model-free learning were approximated using raw choice data by examining probabilities of first-stage stay behavior. Specifically, increased probabilities of stay versus switch following second-stage wins served as a measure of model-free behavior. Increased probabilities of stay versus switch following wins during common transitions, and losses during rare transitions, served as a measure of model-based behavior. To visualize differences in these systems41, we computed difference scores based on prior work by Eppinger et al. (2017). The model-based difference score was taken as: (common/win + rare/lose) – (common/lose + rare/win). The model-free difference score was taken as: (common/win+ rare/win) – (common/lose + rare/lose). These values are plotted as a function of group and condition (Figure 3).
Results from the mixed-effects logistic regression analysis indicate that for reward, there was a significant main effect of outcome (B = 1.27, SE = 0.27, p = <.001) and an outcome-by-transition interaction (B = -1.16, SE = 0.32, p = <.001), but no significant group differences (Table 2). Punishment was similar, with a main effect of outcome (B = 1.15, SE = 0.23, p = <.001) and an outcome-by-transition interaction (B = -0.69, SE = 0.24, p = .003), but again no significant group differences (Table 2). An omnibus mixed-effects logistic regression analysis also including condition revealed similar results, with no group differences (see Supplement). Results therefore support mixed usage of model-free and model-based learning across groups, with computational model results below offering a more granular understanding of group differences in the relative balance of these systems.
Computational Model
The reinforcement learning model consisted of a combination of model-based and model-free weights (βMB and βMF, respectively) that control the relative contribution of model-free and model-based updating on choice behavior. Differences between HC and AN subgroups were encoded in regression coefficients (γ0, γANR, and γANBP) for each freely estimated parameter, with HC being set as the reference group (γ0): if these coefficients differed from zero, then diagnostic subgroups differed on these parameters from HC. To compare AN subgroups to each other, we reconstructed subgroup posteriors and directly compared these posterior distributions to each other: AN-R = γ0 + γANR and AN-BP = γ0 + γANBP. Posterior distributions of transformed γ weights are plotted in Figures 4-5. Below, we report average y values, 95% credibility intervals, and Bayesian P-values (see Method).
In the reward condition, AN-R had significantly higher model-free learning compared to HC (yANR = 0.92, P = .03, 95%CI[-0.02, 1.82]) and lower model-based learning at a trend level (yANR = -0.31, P = .05, 95%CI[-0.70, 0.07]). In the punishment condition, AN-R displayed significantly attenuated model-free learning (yANR = -1.26, P = .002, 95%CI[-2.38, -0.38]), and model-based learning compared to HC (yANR = -0.58, p = .004, 95%CI[-1.01, -0.16]). AN-BP also had attenuated model-free learning for punishment compared to HC at a trend level (yANBP = -0.72, p = .07, 95%CI[-1.99, 0.21]). Overall, AN-R showed greater reliance on model-free learning for the reward condition, but deficits in both model-free and model-based learning for the punishment condition. Conversely, AN-BP only displayed trend-level deficits in model-free learning for punishment relative to HC. These results are displayed in Tables 3 and 4. Estimated y weights for the remaining freely estimated model parameters (α, β2, p) are included in Table S2-S4.
Associations with Resting-State Functional Connectivity
We extracted rsfMRI functional connectivity coefficients from subject-level connectivity maps for OFC-NAcc and SMA-putamen pairs. There were no group differences in connectivity between these ROIs, controlling for z-scored age, BMI, and IQ. There were no significant effects between SMA-putamen connectivity and βMF, including no main effect of connectivity, no connectivity-by-group interaction, and no connectivity-by-condition interaction. For OFC-left NAcc connectivity and βMB, there was a significant main effect of connectivity for AN-R (B = 2.15, SE = 0.92, p = .02), but no significant main effects for any other group. There were also no significant connectivity-by-group interactions, nor were there connectivity-by-condition interactions. Because reward was the reference level for the condition variable, a main effect of connectivity indicated that higher reward βMB was associated with stronger resting-state connectivity between the OFC-left NAcc in AN-R only (Figure 6A-6B). To explore this relationship within the AN-R group, we conducted a post-hoc linear regression analysis predicting reward βMB and specifying an OFC-left NAcc connectivity and EDE Global Score interaction, controlling for the same covariates. The main effect of connectivity (B = 7.53, SE = 2.07, p = .001) and the interaction term (B = -1.92, SE = 0.70, p = .01) were significant, showing that the more positive the relationship between reward βMB and connectivity, the lower the EDE Global Score (Figure S7).
Associations with Clinical Variables
Consistent with prior work41, there were no significant associations between βMB or βMF and symptom measures (BMI, EDE, BDI, or STAI) after family-wise correction. We also did not find significant associations with TCI Harm Avoidance, SPSRQ Reward, SPSRQ Punishment, BIS, or BAS Reward Responsivity. However, when examining βMB and ATQ Effortful Control, there was a significant Effortful Control-by-condition-by-group interaction, suggesting differences in the directionality of both the effect between AN-BP and AN-R and the effect between conditions for AN-BP, detailed in the Supplement (Table S8, Figure S8).
Discussion
This study investigated, amongst subtypes of adolescent AN and HC, alterations in model-based and model-free learning under conditions of reward and punishment, as well as associated links to corticostriatal functional connectivity. Overall, AN-R showed greater reliance on model-free learning for reward, but decreased utilization of both model-free and model-based learning for punishment. Moreover, in AN-R only, higher model-based learning for reward was associated with stronger functional connectivity between regions implicated in goal-directedness, and this was associated with lower eating disorder severity. In contrast, adolescents with AN-BP displayed trend-level deficits in model-free learning for punishment, as well as a negative relationship between effortful control and model-based learning for reward, but no other significant behavioral or neural differences when compared to HC or AN-R.
Together, results suggest substantive differences in the balance between model-free and model-based learning between AN-R and HC across conditions, as well as differentiation between AN subtypes primarily under reward. Such alterations in adolescent AN-R may reflect over-reliance on inflexible, habit-based reward learning and inadequate utilization of flexible, goal-directed learning.
Parameter estimates from the reinforcement learning model indicate that AN-R displayed patterns of model-free and model-based learning similar to those observed in OCD40. There is accumulating evidence that disorders characterized by compulsivity (e.g., OCD, substance use) rely less on model-based learning during sequential reward-based decision-making tasks25,37–39,41,44, indicating that this learning bias may be a transdiagnostic feature of psychiatric conditions characterized by cognitive or behavioral perseveration. Of note, greater utilization of model-free learning in our AN-R sample may be aligned with etiological models purporting that dietary restriction is initially maintained by operant conditioning (e.g., reinforced by reward of weight loss and praise) that becomes a classically conditioned habit over time8,73,74 and raises the question of whether a tendency towards reward-based model-free learning may reflect a premorbid susceptibility factor that predisposes the early development of dietary habits. This hypothesis is aided by the positive relationship between model-based learning and OFC-to-left NAcc functional connectivity in AN-R only, potentially indicating that heightened neural synchrony in reward circuitry is connected to greater reward-based goal pursuit. Prior studies have reported disturbances in reward-related functional architecture in AN75,76, particularly in ventral and dorsal frontostriatal networks, that support etiological theories implicating alterations in reward functioning and over-reliance on habit7,73,77,78. Our post-hoc analysis further clarified that a stronger, positive association between OFC-to-left NAcc functional connectivity and model-based learning was linked to less severe eating pathology, providing potential evidence of a brain-based protective marker, though this was exploratory. If true, this would suggest that learning systems contributing to decision-making under reward and corresponding neural circuitry may be of particular clinical importance79.
While behavioral patterns present for AN-R are relatively consistent with prior work in psychiatric samples25,37–39,41,44, the lack of difference between AN-BP and HC was unexpected. A prior study showed that AN displayed attenuated model-based learning compared to HC, with this effect being largely driven by AN-BP41. Developmental factors and/or illness stage may have contributed to our findings; indeed, prior research demonstrates that utilization of model-based learning can change over the course of development, with model-based systems coming online during adolescence and strengthening into adulthood42,80–83. Further, higher compulsivity at baseline in healthy adolescents is associated with diminished strengthening of model-based control at a later time point, suggesting that maturation of this system can be hijacked by compulsive psychological features82. It is therefore possible that adolescent AN-R reflects delayed maturation of these systems and AN-BP may not show signs of deficiency at this stage, but may fail to improve as a function of age. Longitudinal studies are needed to resolve this issue.
Our results also suggest that AN-R and AN-BP are behaviorally dissociable, particularly in their responses to reward learning. AN-R utilized greater model-free learning for reward, while AN-BP did not significantly differ from HC in either learning system. Exploratory analyses with clinical variables supported the distinction between subtypes as well, with greater reward-related model-based learning in AN-BP linked to lower levels of adaptive effortful control (i.e., self-regulation), while AN-R displayed a weaker, positive relationship. One possible explanation for this distinction may be differences in saliency of reward. It is postulated that AN-BP is characterized by greater sensitivity to reward compared to AN-R, with some studies corroborating this hypothesis53,84,85, and others contradicting it52,54. In the current study, self-reported sensitivity to reward is intact in the AN sample and is not differentiated by subtype, but utilization of model-free/model-based learning to make decisions in a rewarding context is.
Model-based control has high computational expense, making it an effortful choice. If individuals with AN-BP are indeed more sensitive to reward, it is possible that positive outcomes (e.g., monetary gain, palatable food consumption) have higher motivational and incentive salience, leading to use of a cognitively demanding strategy even in the context of lower trait self-regulatory control. This is consistent with findings that adults with bulimic-type symptoms are more willing to work for reward as demonstrated by a higher breakpoint during a progressive ratio task86, and prior research demonstrating that model-based learning can be modulated by stress and cognitive control61,87.
It is worth noting that the results we observed in the punishment condition were not consistent with our hypotheses. Despite greater self-reported sensitivity to punishment (i.e., subjective responsivity and/or avoidance of aversive consequences), AN-R displayed reductions in both model-based and model-free learning compared to HC, raising a question of how this group ultimately made choices if they were relatively deficient in both systems during punishment. Notably, this is in contrast to prior work suggesting increased model-based learning for punishment, which has been observed in OCD40, serotonin-depletion88, and in healthy adolescents and adults80. We previously reported lower learning rates for both reward and punishment in a predominantly young adult sample of women with AN-R; however, worse learning on punishment trials was associated with poorer outcome, suggesting that these individuals may have particular difficulty modifying expectations when punishment is possible.
Whether this relates to active avoidance of aversive outcomes or cognitive inflexibility is difficult to determine, although in the current study, the punishment inverse temperature parameter (β2) was significantly lower in AN-R compared to HC, which indicates more exploration of choices and less exploitation of choice valuation during the second stage (Table S3). Alternatively, AN-R may have utilized another learning system or strategy (e.g., successor representation) for the punishment condition, although more research is needed to explore this hypothesis.
This study was the first to investigate model-based/model-free learning in adolescent samples of AN-R, AN-BP, and HC. Strengths of this work include a well-characterized adolescent AN sample, use of a well-validated two-step decision-making task with monetary outcomes to avoid symptom provocation, modulation of outcome valence, use of a well-validated computational model to elicit latent learning features, and the inclusion of rsfMRI to examine relationships between cognitive and neural features of learning. Results add to growing literature on the arbitration between these learning systems in psychopathology and provide greater insight into not only earlier stages of development and disease progression, but also valence-specific effects that may be of particular importance in development and maintenance of AN.
Limitations
Despite these strengths, limitations of this study highlight directions for future work. Our sample included fewer participants with AN-BP compared to AN-R and HC, and both weight-restored and underweight adolescents with AN were included. Additionally, neural data only consisted of rsfMRI scans, with a limited ability to examine task-specific neural functioning that would correspond to these constructs. Specifically, we found no relationship between learning and connectivity in AN-BP and HC, possibly indicating that these learning systems might not be encoded in the connectivity signatures investigated here but may be more distributed, particularly as our sample is adolescent. Future work should replicate AN-BP findings, examine impact of illness state, and utilize task-based neuroimaging to reveal corresponding neural functioning. Lastly, we recognize that the dual-system framework may have limitations. Collins & Cockburn (2020)89 have argued that the forced dichotomy of the model-based/model-free framework artificially limit the dimensionality of learning processes. Moreover, model-based and model-free systems may not be as dissociable as this framework suggests. Consequently, future work, as suggested by Collins & Cockburn (2020), should explore different dimensions of learning systems, as well as the constructs that may influence components of those systems (e.g., uncertainty).
Conclusions and clinical implications
This is the first study to evaluate the contribution of model-based and model-free learning for reward and punishment in adolescent AN-R and AN-BP. Using a well-validated two-step decision-making task and rsfMRI, we observed important distinctions primarily between AN-R and HC, suggesting increased reliance on model-free learning for reward (e.g., habit) may contribute to the ability to consistently restrict food in AN-R. Exploratory results also suggest that a stronger relationship between model-based learning for reward and neural connectivity in reward circuitry may serve as a protective factor for AN-R. Finally, there is preliminary evidence of distinctions between AN subtypes that may result in divergence of mechanisms contributing to symptom expression. Consequently, interventions that reduce model-free learning and/or improve model-based learning may facilitate recovery in adolescence, though this may be subtype-dependent.
Data Availability
Data will be available upon reasonable request once the manuscript has been published.
Acknowledgements and Disclosures
This work was supported by National Institutes of Health Grant No. R21MH118409. C.B. is funded by NIH/NIMH Grant No. F31MH133362. S.D. is funded from the Natural Sciences and Engineering Research Council of Canada.
The authors report no biomedical financial interests or potential conflicts of interest.
References
- 1.↵
- 2.
- 3.
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.↵
- 37.↵
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.
- 49.
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.
- 70.
- 71.↵
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵