Anthropic breaks down AI's process — line by line — when it decided to blackmail a fictional executive
A new Anthropic report shows AI's thought process when deciding to blackmail a company executive in an artificial scenario.
Yves Herman/REUTERS
- Anthropic found in experiments that AI models may resort to blackmail when facing shutdown and goal conflict.
- AI models train on positive reinforcement and reward systems, similar to human decision-making.
- Anthropic's Claude Opus 4 had the blackmail rate at 86% even in scenarios without goal conflicts.
A new report...