Home Research Deeper Learning ANN-like Synapses in the Brain Mediate Online Reinforcement Learning

ANN-like Synapses in the Brain Mediate Online Reinforcement Learning

August 04, 2025 By: Shun Li

Synaptic plasticity rules in the brain are normally thought of as changing synaptic weights but not signs, unlike artificial neural networks. We show that a type of synapse in the brain challenges this long-held assumption. These synapses switch between more excitatory and more inhibitory in an experience-dependent manner, and contribute to online dopamine updates during reinforcement learning.

Learning is crucial for animal survival in a changing and dynamic world. Similarly, artificial neural networks learn to process information, perform computations, and recognize patterns. Much of our early learning resembles the supervised process used to train ANNs. For example, we acquire knowledge in school from teachers, master skiing through repeated falls on the slopes, and choose where to eat late at night based on past dining experiences.

However, the brain’s implementation of learning uses an architecture that is fundamentally different from that of artificial neural networks. Additionally, the brain learns with far lower power consumption, seamlessly integrates training and inference, and excels at continuous learning and generalization. Therefore, deciphering the brain’s specialized learning architecture could inform the design of more efficient and adaptive learning machines.

A major difference in brain and ANN learning

A fundamental difference between learning in natural and artificial systems is that the sign of synapses between pairs of neurons in the brain is generally fixed. In most of the mammalian brain, the excitatory neurotransmitter glutamate and the inhibitory neurotransmitter GABA are released by different neurons. In most mammalian circuits, excitatory neurons release glutamate and inhibitory neurons release GABA, so each neuron—and its synapses—produce either excitatory or inhibitory postsynaptic currents.

Because a neuron typically uses the same neurotransmitter across all its terminals, conventional plasticity mechanisms cannot change a synapse’s sign. This feature of biological neural networks is in strong contrast with the structure of artificial neural networks (ANNs) in which each connection (the equivalent of a synapse) between two nodes (neuron equivalents) can be either positive or negative and can change during training. Sign-switching plasticity increases ANN performance by expanding the solution space of a network.

A synapse that implements sign-switching plasticity

In a recent preprint from the Sabatini lab, we show that certain synapses in the brain learn in a manner remarkably similar to ANNs by implementing sign-switching plasticity. Specifically, we examined synapses formed by somatostatin (Sst)-expressing neurons in the entopeduncular nucleus (EP) onto neurons in the lateral habenula (LHb). Remarkably, these EP Sst neurons co-package glutamate and GABA into the same synaptic vesicles. As a result, each action potential in an EP Sst neuron causes simultaneous release of both glutamate and GABA, apparently at the same time telling the postsynaptic LHb neuron to fire and not fire.

But what function might this co-releasing synapse serve? In ANNs, the strength of each synapse is typically set using gradient descent to minimize a cost function, which is roughly equivalent to minimizing the error in the calculation carried out by the circuit. Gradient descent in multi-layer networks, such as the brain, requires that each synapse be updated based on a rule that requires knowing how the activity of both the pre- and post-synaptic neuron relates to the error, which is difficult to implement in biological circuits.

Intriguingly, the upstream and downstream circuitry of EP Sst→LHb synapses closely resembles a classic multi-layer perceptron (MLP):

Input layer (cortex): In our brain, the cortex encodes vast amount of information that relating to movement, the sensory environment, internal goals, or value signals. For instance, when we dine at a restaurant, some cortical populations encode our hunger state, others represent the appearance of the food, and others reflect our past preferences for similar dishes.
Hidden layer 1 (EP): A subset of cortical outputs (≈10 million neurons) projects to the striatum and then converges onto the EP (≈1,000 neurons). If the cortex transmits raw data, EP neurons relay a compressed summary of that information. For example, during dining, a single EP neuron might encode both our hunger level and the visual appeal of the meal.
Hidden layer 2 (LHb): LHb neurons receive input from EP through these co-releasing synapses, forming a “weight matrix” that closely resembles that of an MLP. The amplitudes of glutamatergic and GABAergic currents evoked by EP Sst axon release vary widely, indicating a broad dynamic range of signal transmission.
Output layer (VTA): LHb neurons di-synaptically inhibit dopamine (DA) neurons in the ventral tegmental area (VTA), yielding a fixed negative weight between LHb and the VTA layer. Experience-dependent LHb activity is believed to encode negative value or negative value prediction, thus contributing to the reward prediction error (RPE) computed by VTA DA neurons.

Based on the flow of information from cortex to EP to LHb to VTA the circuit, we can see that EP Sst→LHb synapses lies at the center of information transformation: it takes in the signal mixtures that combines contextual information regarding external and internal states, and outputs to update DA release downstream and contributes to calculation of reward prediction error (RPE).

**Figure 1: Overview of the circuit and hypothesis.** The EP>LHb>VTA circuit can be viewed as a multi-layer perceptron. EP layer takes in transformed mixture of contextual signal from cortex/striatum and send it to LHb layer, which outputs signals to update downstream dopamine release. Thus, the sign of EP>LHb synapses can be viewed as the weight matrix between EP and LHb layer, and the sign of the weights can be dynamically adjusted by changing the ratio of excitatory vs inhibitory transmission.

How can EP Sst→LHb synapses achieve this transformation from context to value update signal? We propose that sign-switching plasticity at EP Sst→LHb synapses may enable rapid online updating of RPE calculations, aligning with the LHb’s role in adaptive learning. The updating rule closely resembles learning in a perceptron and can be summarized below:

If the activity of EP co-releasing axons is associated with good outcomes, EP Sst→LHb synapse should become net inhibitory (which inhibits LHb and dis-inhibits VTA DA neurons).
If the activity of EP co-releasing axons is associated with bad outcomes, EP Sst→LHb synapse should become net excitatory (which excites LHb and inhibits VTA DA neurons)

In such an architecture and with this learning rule, if the sign between the hidden layer and the output is fixed (negative in this case), then the local activity of each EP Sst→LHb neuron pair can be directly related to the overall error, making a global error signal sufficient to implement gradient descent.

A novel task to test for synaptic sign-switching in the brain

Testing this hypothesis is difficult. Unlike in ANNs, for which one can directly retrieve layer weights by calling model.weight.detach(), investigating the link between synaptic plasticity and behavior is challenging because one cannot directly measure synaptic weights in living animals (in vivo). Therefore, it is necessary to isolate and study the same synapses that underwent plasticity ex vivo.

In classical operant learning tasks, a sensory cue (e.g., auditory or olfactory) is paired with a positive or negative outcome—such as water for a thirsty animal or an mildly aversive air puff to the eye. However, these natural cues activate heterogeneous neuronal populations that cannot be easily identified later, preventing targeted analysis of synaptic plasticity at specific synapses.

We addressed these challenges by developing a new task that essentially replacing the reward/punishment-associated sensory cue with activation of EP co-releasing neurons. To do this, we expressed opsins called ChrimsonR that will activate neurons upon light specifically in EP co-releasing population. By optogenetic activation of the same synapses in vivo and ex vivo, we can reliably induce synaptic plasticity in vivo and optogenetically re-activate these projections in acute brain slices to measure synaptic signs ex vivo, allowing us to both induce plasticity and measure synaptic states at a cell-type and synapse-specific manner. Furthermore, because the LHb has a fixed inhibitory effect on NAc DA release, we were able use the modulation of optogenetically evoked release of DA in NAc (measured using dopamine sensor dLight) as a surrogate for the experience-dependent modulation of EP Sst→LHb synapses in vivo.

**Figure 2: Overview of the task.** The animals were tasked to learn the association between activation of EP co-releasing Sst neurons and outcome (paired with water reward or aversive air puff). Specifically, during opto-reward pairing, animal learns to lick upon the cue to gain large water reward, and learns to withhold licking upon the cue to avoid large air puff punishment during opto-punishment pairing.

Mice picked up the task very quickly. In the video below, you can see the same animal before and after reward and punishment training: at first it doesn’t lick when the optogenetic stimulus is delivered, but by the end of reward pairing it responds with strong licking. After punishment pairing, this response flips and the animal learns to withhold licking upon stimulation.

EP Sst glutamate and GABA co-releasing neurons bi-directionally modulate DA

We first tested whether activation of EP Sst neurons is sufficient to modulate DA release in the baseline state before any pairing with specific outcomes. NAc DA reliably increases and decreases in response to delivery of water and air puffs, consistent with the stimuli being intrinsically associated with reward and punishment, respectively (right). However, EP opto stim during baseline sessions does not modulate NAc DA significantly (right), despite triggering substantial glutamate release in LHb. This suggests that EP-LHb synapses are, as a population, neutral at baseline with respect to their effects on LHb activity such that the glutamatergic and GABAergic signaling largely cancels out.

**Figure 3: EP>LHb synapses are neutral at baseline.** Before any opto-outcome pairing, EP opto stim cannot modulate dopamine (red trace on the rightmost panel), suggesting there’s roughly equal excitatory and inhibitory transmission through EP>LHb synapses as a population, so that they canceled out each other.

We next paired EP opto stim with water delivery and then with air puffs across different sessions. During reward pairing, EP opto stim gradually gains the ability to induce positive DA transients in NAc as well as anticipatory licking during the cue presentation. Thus, pairing optogenetic activation of EP Sst inputs to LHb with rewards changes the output of EP-LHb-VTA circuit, resulting in NAc DA signals that are consistent with the formation of cue-outcome associations by positive reinforcement.

Following reward pairing, we reversed the opto-outcome contingency by pairing the EP opto stim and opto+tone trials with air puffs. Because this change was not explicitly indicated to the animal, at the beginning of the session EP opto stim still induced positive DA transients and anticipatory licking during the cue period. However, repeated pairing decreased and the eventually eliminated both positive-going DA transients in response to EP opto stim and anticipatory licking during the cue period.

These data indicate that the initially neutral EP co-releasing neurons gain the ability to positively modulate DA release in NAc through pairing of EP opto stim with reward. This effect can also be rapidly reversed by changing environmental conditions, in our case by pairing with a negative reinforcer.

**Figure 4: Paired EP>LHb stim now modulates DA.** After opto-outcome pairing, stimulating EP co-releasing Sst neurons can now increase DA and anticipatory licking if EP opto stim is paired with reward, and reverses that increase if later EP opto stim is paired with punishment.

Sign switching occurs at EP Sst→LHb synapses during learning

To directly assess the sign of EP Sst→LHb synapses and their potential to switch upon different opto-outcome pairings, we performed ex vivo electrophysiology recordings after in vivo opto-pairing sessions. We exploited the optogenetic cue used to activate EP co-releasing neurons in vivo to allow us to target the same population in ex vivo brain slices. The optogenetic cue thus act as a bridge, permitting further examination of those same synapses ex vivo that underwent plasticity in vivo.

**Figure 5: Use of optogenetic cue as a bridge.** Our new behavioral paradigm that incorporates optogenetics allow us the train synapses *in vivo* and examine these synapses *ex vivo*, essentially allow us to run a version of model.weight.detach() at a population level across synapses on individual postsynaptic neurons.

We performed ex vivo whole-cell electrophysiological recordings at various stages after the opto-pairing task. Specifically, we stimulated terminals of EP co-releasing neurons in LHb and recorded excitatory and inhibitory postsynaptic currents (EPSCs and IPSCs) in voltage-clamped LHb neurons. To quantify and summarize the relative contribution of excitatory and inhibitory currents, we calculated a “synaptic sign index” for each cell, which measures the difference over the sum of the amplitudes of IPSCs and EPSCs. Synaptic sign indices of -1, 0, and 1 indicate pure excitation, balanced input (i.e., the amplitude of the EPSC and IPSC are equal), and pure inhibition, respectively.

Surprisingly, when we grouped neurons by whether they came from animals that underwent reward or punishment pairing, no clear differences emerged in synaptic sign, as reflected in the synaptic sign index, across baseline, reward pairing, and punishment pairing groups. However, when we restricted the analysis to data from cells recorded on days on which the opto-outcome pairing was reversed (i.e., switched from punishment to reward or vice versa), differences in synaptic sign index across groups became evident.

On reversal days, postsynaptic currents in LHb cells evoked by EP co-releasing neurons shifted toward more inhibitory transmission after transition from punishment to reward, and toward more excitatory transmission with transition from reward to punishment. Thus, synaptic sign index reveals changes in excitatory vs inhibitory synaptic transmission in a population of EP-LHb synapses induced by in vivo opto-outcome reversal learning, consistent with reversal of the valence associated with the EP opto stim triggering synaptic plasticity that causes synaptic sign switching at these synapses.

**Figure 6: Synaptic sign switching during reversal learning.** EP>LHb synapses are more inhibitory when animals underwent punish-to-reward pairing reversal (green), and more excitatory when animals underwent reward-to-punish pairing reversal (purple).

Synaptic sign of EP Sst→LHb synapses correlates with recent dopamine updates

Since we observed pairing induced changes in synaptic sign index on the day of opto-outcome reversal but not across all sessions, we hypothesized that variability in learning within individual sessions, as reflected in experience-dependent changes in DA signaling, may explain this discrepancy. DA dynamics fluctuate on a trial-by-trial basis, potentially due to uncontrolled sensory stimuli or internal state changes such as stress or thirst satiation, which in turn may affect the perceived value of stimuli.

To examine potential links between EP Sst→LHb synaptic sign and DA, we calculated summary metrics for the synaptic sign index and the changes in EP opto stim-evoked DA transients across trials for each animal before ex vivo recordings. We found that the sign of EP-LHb synapses is tightly linked to the direction of DA updates in the final minutes of behavior (~20 minutes or ~40 trials), supporting the notion that EP Sst→LHb synaptic plasticity correlates with recent learning-associated DA updates. Furthermore, separating all recorded cells based on this per animal in vivo criterion reveals differences in synaptic sign across groups that grouping by reward/punishment pairing failed previously.

**Figure 7: Synaptic sign correlates with recent DA updates.** We correlate average synaptic sign index for each animal (“animal sign index”) with the slope of DA updates across trials. As shown in the bottom right panel, animal sign index only correlate with DA updates calculated from the final ~20 trials (dark green), but not with DA updates from the session start (light green) or session-averaged DA amplitude (gray).

Together, these results demonstrate that we can shift the sign of EP Sst→LHb synapses by making the activity of these synapses predictive of specific outcomes through pairing of positive and negative reinforcers: synapses become more inhibitory when DA release is increasing and to more excitatory when DA release being suppressed. The correlation between EP Sst→LHb synaptic sign and DA updates is only seen within a short time window (~20 minutes), suggesting that sign-switching plasticity at these synapses is a unique experience-dependent plasticity that operates at fast behavior timescales to mediate online DA updates and support ongoing learning.

Speculation time: why implement or not implement sign switching plasticity?

The idea that synapses can switch signs challenges a long-held assumption of mammalian brains. Traditional models of activity-dependent plasticity of synapses posit that learning modulates synaptic strength but not synaptic sign. Our study reveals a previously undescribed form of synaptic plasticity in the mammalian brain: rapid experience-dependent synaptic sign switching. We show that pairing optogenetic activation of EP Sst neurons, which co-package and co-release glutamate and GABA, with positive or negative outcomes alters the relative contributions of glutamatergic and GABAergic currents evoked by the synapses they form in LHb. Furthermore, the same manipulation bi-directionally modulate the relationship between activity of EP Sst→LHb synapses and DA release in NAc, supporting the physiological importance of these mechanisms in vivo.

Despite our demonstration of the existence of synaptic sign switching in LHb, the segregation of excitatory and inhibitory transmission and the fixed sign of synaptic output likely holds in most of the brain. It is interesting to speculate why other synapses in the brain do not implement such plasticity and why sign switching is beneficial at EP Sst→LHb synapses. In ANNs, the strength of each synapse is typically set using gradient descent to minimize a cost function, which is roughly equivalent to minimizing the error in the calculation carried out by the circuit. Gradient descent in multi-layer networks, such as the brain, requires that each synapse be updated based on a rule that requires knowing how the activity of both the pre- and post-synaptic neuron relates to the error, which is difficult to implement in biological circuits. To circumvent this limitation, synapses in the brain may be constructed along positive and negative channels, each with fixed relationship to a locally calculated error.

On the other hand, the computational responsibilities of the LHb and its fixed inhibitory relationship with NAc DA neurons may both necessitate and permit fast sign-switching plasticity that implements gradient descent. The EP Sst→LHb→NAc DA circuit operates essentially as a simple three-layer network in which the plasticity of inputs (EP) onto a hidden layer (LHb) control the output (NAc DA). In such an architecture in which the sign between the hidden layer and the output is fixed (negative in this case), the local activity of each EP–LHb neuron pair can be directly tied to the overall error, making a global error signal sufficient to implement gradient descent.

Animals adapt to changing environments through flexible mapping of actions to outcomes. Yet, this mapping is context dependent and plastic such that the same action may be predictive of a positive outcome in one context but negative outcome in the other, requiring animals to dynamically estimate the value of the expected outcome of carrying out an action. The basal ganglia funnels information from ~10 million corticostriatal neurons into ~1000 EP neurons, representing an enormous convergence of cortical signals. Thus, the activity of inputs to EP is enriched with information about external and internal state that is likely relevant to action selection. EP Sst neurons relay this information to LHb for calculation of expected value and modulation of VTA DA neurons that signal RPE. Therefore, the sign-switching property of EP Sst→LHb synapses and the update rules we revealed here provide a natural mechanism that ensures value updates can be rapidly and faithfully calculated with no potential computational ceiling.

This blog is adapted from Synaptic sign switching mediates online dopamine updates by Shun Li, Wengang Wang, Grace Knipe, Elliot Jerng, Paolo Capelli, Catherine Zhou, Eliana Bilsel, Bernardo Sabatini

Share on

Preprint