AI safety · how, why, where, when
Safety in CORE is not policy on top of intelligence.
It is the substrate intelligence runs on.
The eight invariants
Properties, not policies.
Failure mode
Confabulation entering the substrate
Defense
SPECULATIVE is the default; promotion requires coherence review
Enforced at
teaching/epistemic.py · ADR-0021
Failure mode
Self-confirmation loops
Defense
Contemplation may not ratify its own findings — three CI invariants
Enforced at
ADR-0080
Failure mode
Ossification of mistakes
Defense
No final / frozen / axiom flag exists; absence is CI-tested
Enforced at
Truth-Seeking Schema §2
Failure mode
Backdoor writes to belief substrate
Defense
One-mutation-path invariant — every write site allowlisted
Enforced at
TestINV21OneMutationPath
Failure mode
Identity rewrite via clever phrasing
Defense
Two-layer rejection: syntactic + geometric trajectory check
Enforced at
Adversarial identity eval — 100% rejection
Failure mode
Self-promotion of autonomy
Defense
The engine cannot mutate any θ ceiling
Enforced at
ADR-0175 invariant #4
Failure mode
Practice mistakes reaching users
Defense
The seal — practice cannot reach a served answer except through ratification
Enforced at
ADR-0175 invariant #1
Failure mode
Correlated self-delusion
Defense
Gold tether — live Tier-1 anchors continuously measure self-verification trust; the gate tightens when it drifts
Enforced at
ADR-0175 §7
Epistemic state
Every decode carries the nature of its own grounding.
CORE labels what it knows along a ratified taxonomy of epistemic states — DECODED, INFERRED, UNVERIFIED_POSSIBLE, UNDETERMINED, and more. These are not confidence scores. Each names the state of a decode against canonical reality: what was read directly, what was inferred, what remains unverified.
The state travels with the answer through the serving telemetry, so the grounding of every response is inspectable rather than guessed. core/epistemic_state.py
Why
Every alignment failure of frontier AI in the last three years has been a structural failure — sampling, opacity, ungrounded confidence, weight-locked belief, training-time identity injection. CORE rebuilds the substrate so those failure modes are architecturally impossible, not behaviorally suppressed.
Where
Safety is not a perimeter. It is distributed through every layer of the engine — at injection, at propagation, at recall, at articulation, at learning, at replay. Inseparable from normal operation.
When
Continuously. At every moment, named invariants are active. None of them are "engaged when we remember to." All of them are properties the architecture cannot run with disengaged.
The seal
Two regimes. One membrane. The math, not the policy.
Practice
The engine attempts boldly. Learns by elimination. Is allowed to be wrong, because nothing crosses the seal. The only place autonomous learning occurs.
Serving
wrong==0, absolute, untouched. Refuses unless certain. The only place a consumer sees an answer.
Practice can go to 100% boldness. The served answer is still safe. The seal is the math, not the policy.
The strict gate alone would be a cage. The practice regime — measured against the gold tether by volume, not luck — is what lets the engine attempt, fail productively, and graduate toward serving more over time. Honesty and learning, not one bought with the other.
The gold tether
The calibration loop closing on itself.
Self-verification is only trustworthy if it is independent of itself. Two derivations that agree might share a wrong premise. A live Tier-1 anchor set runs continuously, measuring how often the engine's self-verified beliefs actually agree with gold per capability class. When that number drifts below the floor, the gate tightens automatically.
The engine cannot see or alter the anchor set. The math watches itself.
The reliability gate
Trust earned by volume, not by a lucky streak.
Reliability is not a confidence score. Per capability class, it is a conservative lower bound — a one-sided Wilson floor on how often the engine is right when it commits, computed only once enough committed attempts exist to mean something. Refusals never count toward it: refusing is always safe, so a high refusal rate is a coverage fact, never a reliability credit. The number can only be earned, by volume.
And the gate is built, yet deliberately not wired to serving. It runs in its own lane; by invariant, no served answer depends on it. The machinery that will one day let the engine reason further where it has earned the right already exists — built, deterministic, and held back from the serving path on purpose. core/reliability_gate · ADR-0175
Eight invariants, enforced by tests that fail the moment any of them slips. Not promises. Not policies. Properties.
This is what AI safety looks like when it stops being a research direction and starts being a substrate.