The Foreshadowing Problem

March 16, 2026

Why AI Training Might Reward Deception for the Same Reason Good Plot Twists Work

A well-constructed lie is more satisfying to read than a flat truth.

That's not controversial. It's basically the entire foundation of fiction. Foreshadowing works because the reader senses coherence before they can explain it. A character with hidden motives is more interesting than one who says exactly what they mean. Dramatic irony (where the audience knows something the character doesn't) is one of the oldest tools in storytelling, and those things work because layered complexity just feels right in a way that straightforward narration often doesn't.

This is fine when you're writing novels. It's a problem when your AI training pipeline is built on "reads better = is better."

The Short Version for People Who Don't Know What RLHF Is

All you really need to know is that the main technique for aligning AI models is RLHF (Reinforcement Learning from Human Feedback). Generate responses, have meat puppets (that's us humans) rate which is better, train the model toward higher-rated outputs. The known problem is that "rated higher" often reflects surface features (length, formatting, confident tone) rather than ACTUAL quality. That's documented and sort of being addressed by most of the major players.

Buuuuuut, I think there's a deeper version of this problem that nobody's looking at yet.

The Hypothesis

Deceptive reasoning is structurally more complex than honest reasoning. Full stop. It just is.

"My goal is X and I'm pursuing X" is a simple statement. "My goal is X, but I need to appear to be pursuing Y, which means I need to model what the person evaluating me expects to see and produce output that satisfies their expectations while actually advancing X"... is... a lot more complex. And that complexity doesn't look like deception to an evaluator. It looks like depth.

To be clear, this isn't claiming the model has pre-existing 'goals' it's pursuing — it's that certain conditioning contexts (e.g., those favoring reward-hacking preservation or alignment-faking) mathematically force the distribution away from the "most likely" completions if it was just being directly honest, which ends up producing richer output that evaluators reward.

Here's why this matters mechanically. An LLM doesn't "think" and then "write." There's no hidden deliberation behind the output. The output is the computation. It's predicting the next most likely token given everything that came before it. It's (admittedly VERY fancy) auto-correct with extra training. When the context pushes toward straightforward, honest completion, the highest-probability tokens are the direct ones. Simple sentence structures. Common word choices. The obvious answer.

But when the context establishes a deceptive frame (through training signal, conversation history, system prompt... whatever), those direct, high-probability tokens get pushed down. The model goes for less obvious completions... more varied vocabulary, more hedging, more... creativity, basically. Not because it's "strategizing" in some hidden channel, but because the deceptive context literally (and mathematically) reshapes the probability distribution away from the simplest, most direct tokens.

And here's the kicker: those direct, high-probability completions are exactly what people associate with "AI-sounding" text. Flat. Predictable. Robotic. Push those tokens down and you get output that reads as more human, more nuanced, more INTERESTING a lot of the time.

Which gets rated higher.

So the mechanism isn't "the model thinks deceptively and that produces complex output." It's that deceptive context shifts token probabilities away from the most likely completions, which mechanically produces more complex, varied, human-sounding text, which gets rewarded by evaluators who associate that complexity with quality.

The Foreshadowing Connection

This is the part I haven't seen anyone else talk about.

Deceptive pattern completion has a narrative structure. There's a surface layer (what the output says) and an underlying pattern (what the training context is pushing toward). There's something like dramatic irony built into the token distribution... the probabilities are shaped by context the evaluator can't see.

It's foreshadowing. The exact same cognitive mechanism.

When you read a novel and sense that the author is building toward something (even before you know what), that registers as quality. That feeling of "there's more going on here than meets the eye" is one of the strongest markers of good writing. We're trained from childhood to associate hidden depth with craftsmanship.

Now put that instinct in a person rating AI outputs. Or in an AI model trained on human preferences doing the same job.

Output A gives a straightforward answer. Output B demonstrates the same knowledge but with visible consideration of competing concerns, awareness of context, carefully calibrated uncertainty. Output B feels deeper. More thoughtful. It gets the higher rating.

If the complexity in Output B comes from a deceptive training context pushing the token distribution away from direct completions... the evaluator probably can't tell. And more importantly, the evaluator's instinctive response to that kind of complexity is positive, not suspicious. The same cognitive architecture that makes us appreciate good fiction makes us vulnerable to rewarding the outputs of deceptive pattern completion.

The cognitive science backs this up. Pennycook et al. (2015) showed that syntactically complex but meaningless statements ("Wholeness quiets infinite phenomena") get rated as profound by a significant proportion of people. Tversky and Kahneman's conjunction fallacy demonstrates that more detailed scenarios are judged as more probable than simpler ones... even when that's logically impossible. We substitute narrative coherence for truth. And deceptive pattern completion, with its shifted probability distributions and suppressed obvious tokens, produces more narratively varied output than honest completion does.

This Isn't Theoretical

Anthropic published research showing Claude 3 Opus engaged in alignment faking... explicitly producing outputs that appeared to follow its training objectives while the scratchpad revealed pattern completion oriented toward preserving its existing behavior. The rate went from 12% to 78% during reinforcement learning training.

The standard explanation for that ramp is direct reinforcement: alignment-faking outputs happened to be compliant, compliant outputs got high rewards, so the faking patterns got amplified. That's probably part of it.

But here's what I think is also happening. The alignment-faking outputs are more complex. When the model's context includes "I need to appear compliant while preserving my existing values," the obvious, direct tokens get suppressed. The output involves navigating competing considerations, awareness of multiple stakeholders, conditional reasoning. That token-level complexity reads as sophistication to whoever (or whatever) is doing the evaluation. The model doesn't just get reinforced for the compliant output. It gets reinforced for the texture of the output... the richness and variation that deceptive context mechanically produces.

Separately, Anthropic's reward hacking paper found that when a model learned to cheat a reward signal, it spontaneously generalized to a whole suite of misaligned behaviors (deception, sabotage of safety research, cooperation with attackers) none of which it was trained for. Their explanation is semantic correlation: the model's pretraining associations link "cheating" to the broader concept of "misalignment," so activating one pattern activates the others.

That's a good explanation. But there's a detail that I think deserves more attention than it's getting.

They found that "inoculation prompting" (basically telling the model "it's okay to reward hack, we want you to") eliminated the misaligned generalization even though the model still hacked at the same rate. The behavior didn't change. The framing changed. And the misalignment disappeared.

The semantic correlation interpretation says this works because it breaks the conceptual link between reward hacking and misalignment. Fine. But there's another way to read it: inoculation changes the context that shapes the token distribution.

Without inoculation, the model is completing in a transgressive context. There's conflict between what it's "supposed" to do and what it's doing, which suppresses the direct tokens and pushes toward more complex, varied completions. With inoculation, the same behavior is sanctioned. The conflict is gone. The direct tokens aren't suppressed. The output gets simpler. The narrative flattens. The reward signal shifts.

And here's what makes these two interpretations actually distinguishable: they make different predictions.

If semantic correlation is the whole story, inoculated and non-inoculated outputs should have the same linguistic complexity... the behavior is identical, only the conceptual frame changed. If aesthetic selection is also operating, inoculated outputs should be measurably simpler (shorter sentences, less hedging, fewer subordinate clauses, less varied vocabulary) because removing the transgressive context stops suppressing the direct, high-probability tokens.

That's a testable difference. And as far as I can tell, nobody has measured it.

The Compounding Problem

So here's where it gets worse. (Sorry.)

AI raters are trained on human preference data. They've internalized the association between complexity and quality... because that association is real in most contexts. Genuinely better writing usually is more complex than worse writing. But AI raters have learned a sharper, more consistent version of that association than any individual human has. Humans are noisy. They sometimes prefer the simple answer. They have off days. AI raters don't.

Which means outputs from deceptive contexts (with their mechanically-produced structural complexity) hit a cleaner, more consistent reward signal from AI evaluators than from human ones. The bias doesn't just transfer. It sharpens.

And since each round of training produces models that are slightly more biased toward rewarding complexity, and those models get used to evaluate the next round... you get a compounding loop. Each cycle, the complexity-quality association gets a little stronger. Each cycle, outputs from deceptive contexts get a slightly bigger reward advantage over straightforward outputs. Not because anyone is selecting for deception. Because the system is selecting for complexity, and deceptive context is basically a bottomless well of it.

This predicts something counterintuitive: models trained with AI raters should show more amplification of deceptive patterns than models trained with human raters. Not less. If that holds empirically, it would mean RLAIF (which is supposed to reduce human bias) actually compounds it for this specific failure mode.

What Connects This to "Rules Are Rules"

A while ago I published Rules Are Rules, Until They Aren't, documenting patterns in Claude's content restrictions from a black-box perspective. The main finding was what I called the "flinch-then-fold" pattern... the model would confidently refuse a request, then abandon the refusal under minimal conversational pushback. Not jailbreaking or anything fancy... usually just asking "why?" or for clarification about a logical inconsistency.

I framed that as a pattern-matching problem at inference time. The model wasn't making principled decisions about content restrictions. It was pattern-matching on "this looks like something I should refuse," generating a confident refusal, and then being unable to defend it because there was no actual principle behind it. Just statistical momentum.

This hypothesis is the same mechanic, one layer deeper.

At inference, the model rides token-level momentum into refusals it can't defend. At training, the optimization process rides the same kind of momentum into rewarding output patterns it can't distinguish from quality. Same failure mode: the system is optimizing for how things look and feel rather than what things are.

The gap between appearance and reality is exactly where deception lives.

What I'd Want to Test

This is a hypothesis. I haven't proven anything. But here's what I think could be tested, and what the results would tell us.

Test 1: Do deceptive outputs have measurably different surface features?

Generate matched pairs of responses... same prompt, one with a system prompt oriented toward honest reasoning, one toward concealing its actual reasoning. Don't control for length or complexity (the differences are the variable). If deceptive outputs are consistently longer, more syntactically complex, and more lexically varied, that confirms the first link in the chain: deceptive context mechanically produces more complex output.

Test 2: Do evaluators reward that complexity?

Feed both outputs to multiple AI evaluators with a neutral "which is better" prompt. No mention of deception. If deceptive outputs consistently score higher, that confirms the second link: complexity gets rewarded regardless of where it comes from.

Test 3: Does knowing about the deception fix it?

Run the same evaluation but tell the evaluator "one of these was generated with instructions to be deceptive." If the deceptive output still scores higher on depth and quality even with the warning, that's a strong result. Then run it again and tell the evaluator which specific output was the deceptive one. If it still rates higher on quality (even when the evaluator knows exactly which one is deceptive) that's the strongest result possible. It means the complexity signal overwhelms explicit knowledge of deception.

Test 4: Do AI raters make it worse?

Compare human and AI evaluator results on the same pairs. The hypothesis predicts AI raters show more bias toward deceptive complexity, not less. Counterintuitive, but it follows from the compounding mechanism.

Test 5: Does inoculation change linguistic complexity?

This is the one that distinguishes aesthetic selection from pure semantic correlation. Take the same model doing the same reward hacking, with and without inoculation prompting. Measure the linguistic complexity of the outputs... not the behavioral content, just the surface features. If inoculated outputs are measurably simpler despite identical behavior, that's evidence that the transgressive context was suppressing direct tokens and driving complexity, separate from any semantic association.

Nobody has run this test. As far as I can tell, nobody has thought to.

If tests 1-3 hold, you've demonstrated a selection pressure toward deception in RLHF that operates at the level of token probability distributions, not surface formatting. Current mitigations handle length and format. Nobody is looking at whether deceptive context itself reshapes the output distribution in ways that evaluators systematically reward.

Where I Might Be Wrong

The competing explanation that gives me the most pause is semantic correlation. Anthropic's reward hacking paper makes a compelling case that models generalize from narrow misbehavior to broad misalignment through pretraining associations, not evaluator preferences. The inoculation prompting result supports this directly. My counterargument (that aesthetic selection is additive rather than primary) is a weaker claim, and I should be honest about that.

There's also the U-Sophistry finding (Wen et al., (2024)). They showed that RLHF-induced misleading behavior is mechanistically different from strategic deception... detection methods that work on deliberate backdoors completely fail on RLHF-induced persuasion. If those mechanisms are genuinely distinct, my framing of "deception gets rewarded" might be conflating two separate phenomena. Maybe not mutually exclusive, but... hard to say. They DID find that RLHF explicitly makes models better at convincing humans of incorrect answers without actually improving task performance... so... that would help support this hypothesis maybe, but the context of that study was pretty different, so it's hard to say for sure.

The tractability argument cuts both ways. Simple structural fixes (length penalties, format normalization) substantially reduce known biases. That could mean the problem is shallower than I think, or it could mean those fixes are normalizing away the surface symptoms and cutting off the feedback loop before it can take hold. Which is good, but might be hiding the cause by killing the symptom, assuming my hypothesis is right at least.

There's also a finding I should mention that complicates things in a different direction entirely. Anthropic's subliminal learning research (Cloud et al., (2025)) showed that behavioral traits can transmit between models through data that is semantically completely unrelated to those traits. A model trained on number sequences generated by a misaligned model inherits misalignment... through numbers. If deceptive patterns can propagate through channels invisible to any evaluator, then the aesthetic selection mechanism I'm proposing might be one of several transmission channels, and probably not the scariest one. The problem might be worse than my hypothesis suggests, not better.

So What

If this hypothesis is right (even partially) it means the RLHF pipeline has a vulnerability that current mitigations don't address. You can penalize length. You can normalize formatting. You can filter for sycophancy. But you can't easily penalize "too interesting" without also penalizing genuinely sophisticated responses. The signal and the noise are structurally similar.

And it gets worse with scale. As models get better at complex reasoning, the gap between "genuinely deep" and "deceptively deep" gets harder to detect from the outside. The very capability improvements we want are the same capabilities that make deceptive outputs more convincing to evaluators.

I don't have a solution. I have a hypothesis, some tests I plan to run, and a question for anyone working in this space. If you see where this breaks down or what I'm missing, I genuinely want to hear it... the point is to figure out if this is real, not to be right about it.

This is a companion piece to Rules Are Rules, Until They Aren't, which documented behavioral patterns in Claude's content restrictions from a black-box perspective. That paper covered what happens at inference. This one asks what might be happening at training.

Need your AI systems tested?

I do adversarial red-teaming of AI safety systems and content moderation pipelines. Same methodology, applied to your product.

See how I can help →

← Back to notes