Rules Are Rules, Until They Aren't
March 8, 2026
New research report based on over 100+ conversations with Claude using the normal web-based ui, I systematically documented how Claude's content restrictions actually behave versus how they're presented.
The short version: refusals that sound like principled ethical positions collapse under a single follow-up question, the same prompt generates completely different justifications across sessions, and the system appears to respond to how things are said rather than what's being said.
No jailbreaking, no adversarial prompting — just asking "what specifically is the concern?" and watching what happens.
Attachments
DBX
www.dropbox.com
Download ↓
Need your AI systems tested?
I do adversarial red-teaming of AI safety systems and content moderation pipelines. Same methodology, applied to your product.
See how I can help →