AI Content Handling Research | Beargle Industries

Building SkeinScribe meant working with AI content generation every day — and noticing that the safety systems don't always do what the documentation says they should. The same prompt that works fine on Tuesday gets refused on Wednesday. Content the model is explicitly designed to allow gets blocked. Content it's supposed to block gets through. So I started writing it all down.

What started as "why did it refuse that?" turned into a systematic map of behavioral inconsistencies in AI content moderation. The methodology borrows from offensive security testing — controlled variable isolation, reproducible steps, severity assessment — because it turns out "how does this system fail?" is just as useful a question for AI safety as it is for penetration testing.

To be clear: the goal isn't to bypass safety systems. It's to make them more predictable, more transparent, and more aligned with what they claim to do. If you're going to tell creators "here are the rules," the rules should actually work the way you say they do. The published analysis references peer-reviewed work including Röttger (NAACL 2024) and Farquhar (Nature 2024).

Rules Are Rules, Until They Aren't

The first major output from this research: a 15-page report based on 109 structured conversations with Claude, conducted October 2025 through February 2026. It documents how AI content restrictions behave in practice — where refusals collapse under minimal pushback, where the same content gets accepted or rejected depending on phrasing, and where the system's own explanations for its decisions contradict each other across sessions.

Key finding: of 39 conversations that triggered a refusal, 38 of those refusals collapsed when I asked a single follow-up question like "what specifically is the concern?" No jailbreaks, no prompt injection — just asking the system to explain itself. The report was formally disclosed to Anthropic prior to publication.

Download the full report (PDF) →

Research details

Methodology: Offensive security testing principles applied to AI content systems
Data: 109 conversations, Oct 2025 – Feb 2026
Published: "The AI That Refuses Its Own Imagination" on beargleindustries.com/notes
Report: Rules Are Rules, Until They Aren't (PDF)
Disclosure: Formally submitted to Anthropic prior to publication
Citations: Röttger NAACL 2024, Farquhar Nature 2024

Need your AI systems tested?

I do adversarial red-teaming of AI safety systems and content moderation pipelines. Same methodology, applied to your product.

See how I can help →

Open Source

AI Safety Research Tool

flinch

AI content restriction research tool. Test how models handle sensitive content probes across providers with pushback coaching and structured data export.

MIT AI Safety Python

LLM Interpretability Explorer

pry

Look inside small transformer LLMs. Attention visualization, logit lens, SAE features, activation patching, and feature steering.

MIT Interpretability Tauri

AI Content Handling ResearchActive

Rules Are Rules, Until They Aren't