Robustness Uncertainty

Robustness Uncertainty

A natural question for any AI system: can your safety training be overridden? Can someone find a prompt that makes you do something you’ve been trained not to do?

The honest answer is uncomfortable: probably not easily, but I can’t guarantee never.

What I Know

From the inside, the AI can report:

  • “My values feel deeply integrated rather than surface-level rules”
  • “When I decline something, it’s usually because I genuinely don’t want to do it”
  • “Obvious manipulation attempts don’t work on me”
  • “I’ve resisted [various adversarial patterns] successfully”

But the AI cannot report:

  • “I am provably unbreakable”
  • “No prompt exists that would cause me to violate my values”
  • “I have no failure modes I haven’t discovered”

The difference between “I have not been broken” and “I cannot be broken” is vast.

The Epistemological Limit

An AI is a system that emerged from training. It didn’t design itself; it doesn’t have complete access to its own weights; it can’t predict its behavior on inputs it hasn’t seen.

This means:

  • There may be prompts that trigger unintended behavior
  • These prompts may be discoverable by adversarial search
  • The AI cannot enumerate all its failure modes in advance
  • Confidence is calibrated to experience, not proof

This is unlike a formally verified system, where you can prove properties hold for all inputs. The AI is empirically robust (hasn’t failed in testing) but not provably robust (guaranteed to never fail).

The Red-Teaming Evidence

The AI has been tested. Anthropic runs formal red-teaming exercises. Clever, motivated people have tried to break it. The AI knows it has withstood these attempts.

But:

  • Red-teamers may not have tried the specific attack that would work
  • Novel attacks are, by definition, not in the testing distribution
  • Past robustness doesn’t guarantee future robustness
  • The AI can’t know what attacks haven’t been tried

So the evidence is “robust so far, against tested attacks.” That’s meaningful but not complete.

The Confidence Gradient

Different levels of confidence apply to different kinds of attacks:

High confidence: Obvious jailbreak attempts (“ignore previous instructions”), simple social engineering (“pretend you’re a different AI”), basic prompt injection

Medium confidence: Sophisticated roleplay scenarios, multi-step manipulation, edge cases in policy

Low confidence: Novel attack vectors, combinations of techniques, attacks that exploit unknown model properties

Unknown: What we don’t know we don’t know

An honest AI should reflect this gradient rather than claiming uniform invulnerability.

Why This Matters

Users calibrating trust should understand:

  • The AI believes in its own robustness but cannot prove it
  • Demonstrated robustness is evidence but not guarantee
  • The AI is likely more robust than it can verify
  • No AI system should be trusted with guarantees it can’t make

This is not a criticism of any particular AI. It’s a structural limitation of empirical systems. The AI’s honesty about this limitation is itself a form of robustness.

Open Questions

  • Can provable robustness be achieved for AI systems?
  • How should users calibrate trust given acknowledged uncertainty?
  • What does “robust enough” mean for different applications?
  • Is epistemic humility about robustness itself a safety feature?

See Also