How to Optimize the Training of Large Language Models? Use Resets, Says Kempner Institute Investigator Kianté Brantley

By Yohan J. John, Ph.D.August 07, 2025

Brantley studies how resetting LLMs can make reinforcement learning more efficient

Kianté Brantley (left) and his team deploy rigorous theory to discover more efficient RL algorithms, which can be used in a variety of machine learning (ML) applications.

Photo credit: Anthony Tulliani

Large language models (LLMs) are a cornerstone of the current revolution in AI, providing users with the unprecedented ability to generate useful text for a variety of practical applications.

In order for LLMs to do their work, researchers must first “train” the models using vast troves of data, a process that can be expensive and time consuming. Kianté Brantley, a Kempner Institute investigator and assistant professor of computer science at SEAS, is at the forefront of efforts to make this process more efficient.

Brantley’s work focuses on reinforcement learning (RL), a family of training procedures that involve using rewards to nudge a model towards desired behaviors. RL-based techniques are often used to fine-tune LLMs to be more accurate and helpful, but their range of applications is far wider. [1]

For example, RL is used in a variety of machine learning (ML) applications and has proven very successful in teaching computers to play board games like Backgammon, Chess, and Go, and video games like Space Invaders and StarCraft. Such games are often treated as test cases that enable RL algorithms to be carefully studied before deploying them in more complex applications such as robotics, medicine, and LLM fine-tuning.

Brantley and his team deploy rigorous theory to discover more efficient RL algorithms. These algorithms can be used in multiple RL applications, including the fine-tuning of LLMs, which is an especially demanding application in terms of computational requirements. One of the most promising of these algorithms is called “resetting.”  

Using Montezuma’s Revenge, a classic video game, to understand resetting

When a ML model makes an error during the RL training process, the resetting algorithm causes the model to go back to where the error was made and try again. Through resetting the model ultimately learns to make better and better decisions. The process of resetting can be illustrated by thinking about how an ML model learns to play a video game. 

Brantley uses the example of the classic 1980s video game called Montezuma’s Revenge. In this game, a player controls an on-screen character whose aim is to collect jewels by overcoming obstacles, finding keys to open doors, and contending with enemies. “It’s a very hard exploration problem,” says Brantley. “You might have to climb down a ladder, jump over a skeleton, retrieve a key and so on. There are so many ways in which you can die, even if you get arbitrarily close to the key. But if you die your reward is always zero.”

Above, an example of the game Montezuma’s Revenge. A exploration and learning strategy based on random actions has worked well for many RL applications, including games like Pong and Space Invaders, but turns out to be far less effective in Montezuma’s Revenge.
Source: Gymnasium

Typically, an RL-based model (usually called an “RL agent”) learns to play Montezumas Revenge by randomly trying out actions and seeing what their consequences are. Trying random actions is typically how an RL agent “explores.” If an action leads to a good outcome, like getting jewels or a key, the agent gets a reward in the form of points, which it then uses to adjust its parameters. This in turn makes the rewarded action more likely the next time the agent plays the game.

This exploration and learning strategy has worked well for many RL applications, including games like Pong and Space Invaders, but turns out to be far less effective in Montezumas Revenge. This is because in Montezumas Revenge, rewards only come at the end of a long chain of actions, most of which are not individually rewarded. This means that the agent receives rewards infrequently in comparison to the number of actions it takes. “Rewards are really sparse, making it a very hard problem,” explains Brantley. 

In situations with sparse rewards, there are fewer opportunities for learning. “Imagine if you got really close to a key but then died,” says Brantley. “If you had taken a different action, you would have gotten a very high reward, right? But the process to get back to that state is very hard because you were exploring.” Moreover, RL agents explore using random actions, so getting back to that state might take a long time.

If the typical RL approach isn’t effective in when rewards are sparse, how can researchers boost the ability of an agent to learn? Some inspiration comes from how humans learn from their past behavior. Human beings are deeply aware of missed opportunities, often using them to learn to do better when similar circumstances arise again. How can an RL agent do something similar, learning from the road not taken?

This is where resetting comes in. For hard exploration problems like Montezumas Revenge, Brantley and his team are using resets to send RL agents back to key decision points to try out the road not taken. They are probing the idea that if the agent is repeatedly returned to the location that it was in just before it made a mistake, it can save time and resources that would otherwise have been wasted in less productive forms of exploration.

Using resetting in LLM fine-tuning

Brantley and his team focus on resetting in the context of LLMs because fine-tuning an LLM involves a similar challenge to teaching an RL agent to play Montezuma’s Revenge. Specifically, the LLM experiences rewards infrequently during fine-tuning.

Fine-tuning an LLM using RL involves trying out different responses to see which ones are most helpful and accurate. The best responses are rewarded with points, just like in a video game.

In introductions to LLMs, the pre-training process is normally described as next word prediction. But the model is more accurately described as performing next token prediction. A token is a chunk of meaningful text that could be a short word or a snippet smaller than a word. For example, a word like “cooked” might be broken up into two tokens: “cook” and “ed.” This allows the LLM to learn the grammatical role of “ed” and generalize it to other contexts.

Specifically, an LLM responds to a user prompt with a series of tokens, which are words or smaller chunks of text, such as “ing” or “ed.” [See sidebar Tokens vs Words.] The LLM generates these one at a time until it has reached a “stop” token (which tells it to stop generating tokens). If the final result is useful or relevant, the LLM gets a reward. So all the words leading up to the final word are analogous to the actions leading up to a key or a jewel in Montezuma’s Revenge.

Brantley explains how resetting works during LLM fine-tuning. “We can make the LLM go back to a partial sentence that it previously generated and make it produce alternative completions,” he says. “Just like in Montezuma’s Revenge, if we can reset our LLM back to really good states, then it has a better chance of achieving a high reward.” And the more rewards an LLM gets, the more rapidly it converges on accurate and useful responses to queries.

An efficient way to compute expected returns from future actions

Brantley and his collaborators have shown that resetting is much more than just a way to save on exploration time. One of their key findings is that resetting can reduce the computational expenditure of RL algorithms.

Most modern RL algorithms involve computing something called an “expected return” from each possible starting point. The expected return is the average amount of reward that the model would receive if it took each sequence of actions that is available at that point.

Traditional RL algorithms require a dedicated value model trained to predict the expected returns for each starting point. A value model consumes a lot of computational resources, since it is a large model and it must compute the expected returns for a vast number of possible actions.

Brantley and his collaborators have shown that with resets, an RL algorithm can learn effectively without using a value model, thus saving on time and computing power. They are also using mathematical theory to show that this computational cost-cutting does not come at the expense of accuracy. In fact, they show that the resetting approach closely matches the results obtained when a value model is used.

“In the past, reset was just good for exploration,” says Brantley. “But now we’re showing that it’s good for helping algorithms become very compute efficient.”

To learn more about Brantley’s research, visit his homepage.

About the Kempner

The Kempner Institute seeks to understand the basis of intelligence in natural and artificial systems by recruiting and training future generations of researchers to study intelligence from biological, cognitive, engineering, and computational perspectives. Its bold premise is that the fields of natural and artificial intelligence are intimately interconnected; the next generation of artificial intelligence (AI) will require the same principles that our brains use for fast, flexible natural reasoning, and understanding how our brains compute and reason can be elucidated by theories developed for AI. Join the Kempner mailing list to learn more, and to receive updates and news.

  1. For more information on how RL is used to train LLMs, see our recent Kempner Byte on the subject: From Lab Rats to Chatbots: On the Pivotal Role of Reinforcement Learning in Modern Large Language Models.[]