Home Artificial intelligence Microsoft researchers crack AI guardrails with a single prompt
Artificial intelligence

Microsoft researchers crack AI guardrails with a single prompt

Share



  • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
  • Multiple iterations can further erode built-in safety guardrails
  • They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”





Source link

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Artificial intelligence

Top Robotics Stocks Worth Investing in the Second Half of 2026

An updated edition of the April 16, 2026, article. The American robotics industry...

Artificial intelligence

Expert’s horror warning for how AI will end the world and ‘destroy humanity’ | World | News

An expert on Artificial Intelligence issued a horror warning that the technology...

Artificial intelligence

How Anthropic, OpenAI and Nvidia Are Driving the AI Economy

Artificial intelligence apps are quickly becoming ubiquitous — for personal and enterprise use...

Artificial intelligence

How Lumo uses machine learning to streamline E&L screening

 In this interview, industry expert Dr. Anthony Grice explains how machine learning...