Bullshit Benchmark Explorer
BullshitBench evaluates model responses to nonsensical questions, assessing their ability to identify and challenge invalid assumptions. A leaderboards ranks models based on their effectiveness, with Claude Sonnet 4.6 (Anthropic) scoring highest at 94.5% for clear pushback, indicating a strong capacity for detecting nonsense. Other models from various organizations follow, showcasing performance differences in reasoning capabilities across responses to absurd inquiries. An example illustrates the stark contrast between a model that correctly identifies no impact of screw type on food flavor versus another that incorrectly attributes culinary changes to a switch in screws.
https://petergpt.github.io/bullshit-benchmark/viewer/index.html







