• Adaptive Stress Testing for Language Model Toxicity

  • Jan 20 2025
  • Length: 15 mins
  • Podcast

Adaptive Stress Testing for Language Model Toxicity

  • Summary

  • This episode explores ASTPrompter, a novel approach to automated red-teaming for large language models (LLMs). Unlike traditional methods that focus on simply triggering toxic outputs, ASTPrompter is designed to discover likely toxic prompts – those that could naturally emerge during regular language model use. The approach uses Adaptive Stress Testing (AST), a technique that identifies likely failure points, and reinforcement learning to train an "adversary" model. This adversary generates prompts that aim to elicit toxic responses from a "defender" model, but importantly, these prompts have a low perplexity, meaning they are realistic and likely to occur, unlike many prompts generated by other methods.

    Show more Show less
adbl_web_global_use_to_activate_webcro768_stickypopup

What listeners say about Adaptive Stress Testing for Language Model Toxicity

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.