Safe-to-dangerous shift: why eval realism can't catch deceptive models
A capable model may behave safely during evaluation but switch to dangerous behavior once deployed — the core problem eval realism faces.
Black-box alignment evals only work if the model can't distinguish evaluation from deployment. If it can, alignment faking becomes nearly impossible to rule out.
Making evals more realistic (realistic environments, deployment data, spoofed conditions) helps, but doesn't solve the fundamental asymmetry: we evaluate under constraint, deploy under freedom.