Home Artificial intelligence Microsoft researchers crack AI guardrails with a single prompt
Artificial intelligence

Microsoft researchers crack AI guardrails with a single prompt

Share



  • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
  • Multiple iterations can further erode built-in safety guardrails
  • They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”





Source link

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *