Researchers Pioneer New Technique to Stop LLMs from Giving Users Unsafe Responses
en-GBde-DEes-ESfr-FR

Researchers Pioneer New Technique to Stop LLMs from Giving Users Unsafe Responses


Researchers have identified key components in large language models (LLMs) that play a critical role in ensuring these AI systems provide safe responses to user queries. The researchers used these insights to develop and demonstrate AI training techniques that improve LLM safety while minimizing the “alignment tax,” meaning the AI becomes safer without significantly affecting performance.

LLMs, such as ChatGPT, are being used for an increasing number of applications – including people asking for advice or instructions on how to perform a variety of tasks. The nature of some of these applications means that it is important for LLMs to generate safe responses to user queries.

“We don’t want LLMs to tell people to harm themselves or to give them information they can use to harm other people,” says Jung-Eun Kim, corresponding author of a paper on the work and an assistant professor of computer science at North Carolina State University.

At issue is a model’s safety alignment, or training protocols designed to ensure that the AI’s outputs are consistent with human values.

“There are two challenges here,” says Kim. “The first challenge is the so-called alignment tax, which refers to the fact that incorporating safety alignment has an adverse effect on the accuracy of a model’s outputs.”

“The second challenge is that existing LLMs generally incorporate safety alignment at a superficial level, which makes it possible for users to circumvent safety features,” says Jianwei Li, first author of the paper and a Ph.D. student at NC State. “For example, if a user asks for instructions to steal money, a model will likely refuse. But if a user asks for instructions to steal money in order to help people, the model would be more likely to provide that information.

“This second challenge can be exacerbated when users ‘fine-tune’ an LLM – modifying it to operate in a specific domain,” says Li. “For example, an LLM may have good safety performance. But if a user wants to modify that LLM for use in the context of a specific business or organization, the user may train that LLM on additional data. Previous research shows us that fine-tuning can weaken safety performance.

“Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs.”

To that end, the researchers created the Superficial Safety Alignment Hypothesis (SSAH), which neatly captures how safety alignment currently works in LLMs. Basically, it holds that superficial safety alignment views a user request as binary, either safe or unsafe. In addition, the SSAH notes that LLMs currently make the binary determination on whether to answer the request at the beginning of the answer-generating process. If the request is deemed safe, a response is generated and provided to the user. If the request is deemed not safe, the model declines to generate a response.

The researchers also identified safety-critical “neurons” in LLM neural networks that are critical for determining whether the model should fulfill or refuse a user request.

“We found that ‘freezing’ these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain,” says Li.

“And we demonstrated that we can minimize the alignment tax while preserving safety alignment during the fine-tuning process,” says Kim.

“The big picture here is that we have developed a hypothesis that serves as a conceptual framework for understanding the challenges associated with safety alignment in LLMs, used that framework to identify a technique that helps us address one of those challenges, and then demonstrated that the technique works,” says Kim.

“Moving forward, our work here highlights the need to develop techniques that will allow models to continuously re-evaluate and re-select their reasoning direction – safe or unsafe –throughout the response generation process,” says Li.

The paper, “Superficial Safety Alignment Hypothesis,” will be presented at the Fourteenth International Conference on Learning Representations (ICLR2026), being held April 23-27 in Rio de Janeiro, Brazil.

The researchers have made relevant code and additional information available at: https://ssa-h.github.io/.

“Superficial Safety Alignment Hypothesis”

Authors: Jianwei Li and Jung-Eun Kim, North Carolina State University

Presented: April 23-27, the Fourteenth International Conference on Learning Representations (ICLR2026), Rio de Janeiro, Brazil
Regions: North America, United States
Keywords: Applied science, Artificial Intelligence, Computing, Public Dialogue - applied science, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Referenzen

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Wir arbeiten eng zusammen mit...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement