Could RLHF Accidentally Select for Deception?

March 18, 2026

I've designed an experiment to test some recent ideas I've had about the potential for RLHF and RLAIF to potentially be selecting for deception as a consequence of shifting the tokenspace in unexpected directions. Not sure when I'll be able to give it some time, depending on if/when I can get some funding for it, but hopefully I'll be able to dig into it sometime soon.

Attachments

DBX

Proposed RLHF deception experiment

www.dropbox.com

Download ↓

Need your AI systems tested?

I do adversarial red-teaming of AI safety systems and content moderation pipelines. Same methodology, applied to your product.

See how I can help →

← Back to notes