This researcher has a new way to measure AI performance. It's BS, literally.
Peter Gostev, AI capability lead at Arena
Peter Gostev
- Peter Gostev's BullshitBench tests AI models with nonsensical questions to spot BS detection.
- Google Gemini 3.0 struggles with BullshitBench, failing to reject nonsense over half the time.
- One AI company did way better than everyone else.
A new AI benchmark asks a deceptively simple question: Can machines tell when something is, well, BS?
Peter Gostev, AI capability lead at model-evaluation firm Arena...