AI safety · how, why, where, when

Safety in CORE is not policy on top of intelligence.
It is the substrate intelligence runs on.

The eight invariants

Properties, not policies.

Failure mode

Confabulation entering the substrate

Defense

SPECULATIVE is the default; promotion requires coherence review

Enforced at

teaching/epistemic.py · ADR-0021

Failure mode

Self-confirmation loops

Defense

Contemplation may not ratify its own findings — three CI invariants

Enforced at

ADR-0080

Failure mode

Ossification of mistakes

Defense

No final / frozen / axiom flag exists; absence is CI-tested

Enforced at

Truth-Seeking Schema §2

Failure mode

Backdoor writes to belief substrate

Defense

One-mutation-path invariant — every write site allowlisted

Enforced at

TestINV21OneMutationPath

Failure mode

Identity rewrite via clever phrasing

Defense

Two-layer rejection: syntactic + geometric trajectory check

Enforced at

Adversarial identity eval — 100% rejection

Failure mode

Self-promotion of autonomy

Defense

The engine cannot mutate any θ ceiling

Enforced at

ADR-0175 invariant #4

Failure mode

Practice mistakes reaching users

Defense

The seal — practice cannot reach a served answer except through ratification

Enforced at

ADR-0175 invariant #1

Failure mode

Correlated self-delusion

Defense

Gold tether — live Tier-1 anchors continuously measure self-verification trust; the gate tightens when it drifts

Enforced at

ADR-0175 §7

Epistemic state

Every decode carries the nature of its own grounding.

CORE labels what it knows along a ratified taxonomy of epistemic states — DECODED, INFERRED, UNVERIFIED_POSSIBLE, UNDETERMINED, and more. These are not confidence scores. Each names the state of a decode against canonical reality: what was read directly, what was inferred, what remains unverified.

The state travels with the answer through the serving telemetry, so the grounding of every response is inspectable rather than guessed. core/epistemic_state.py

Why

Every alignment failure of frontier AI in the last three years has been a structural failure — sampling, opacity, ungrounded confidence, weight-locked belief, training-time identity injection. CORE rebuilds the substrate so those failure modes are architecturally impossible, not behaviorally suppressed.

Where

Safety is not a perimeter. It is distributed through every layer of the engine — at injection, at propagation, at recall, at articulation, at learning, at replay. Inseparable from normal operation.

When

Continuously. At every moment, named invariants are active. None of them are "engaged when we remember to." All of them are properties the architecture cannot run with disengaged.

The seal

Two regimes. One membrane. The math, not the policy.

Practice

The engine attempts boldly. Learns by elimination. Is allowed to be wrong, because nothing crosses the seal. The only place autonomous learning occurs.

Serving

wrong==0, absolute, untouched. Refuses unless certain. The only place a consumer sees an answer.

Practice can go to 100% boldness. The served answer is still safe. The seal is the math, not the policy.

The strict gate alone would be a cage. The practice regime — measured against the gold tether by volume, not luck — is what lets the engine attempt, fail productively, and graduate toward serving more over time. Honesty and learning, not one bought with the other.

The gold tether

The calibration loop closing on itself.

Self-verification is only trustworthy if it is independent of itself. Two derivations that agree might share a wrong premise. A live Tier-1 anchor set runs continuously, measuring how often the engine's self-verified beliefs actually agree with gold per capability class. When that number drifts below the floor, the gate tightens automatically.

The engine cannot see or alter the anchor set. The math watches itself.

The reliability gate

Trust earned by volume, not by a lucky streak.

Reliability is not a confidence score. Per capability class, it is a conservative lower bound — a one-sided Wilson floor on how often the engine is right when it commits, computed only once enough committed attempts exist to mean something. Refusals never count toward it: refusing is always safe, so a high refusal rate is a coverage fact, never a reliability credit. The number can only be earned, by volume.

And the gate is built, yet deliberately not wired to serving. It runs in its own lane; by invariant, no served answer depends on it. The machinery that will one day let the engine reason further where it has earned the right already exists — built, deterministic, and held back from the serving path on purpose. core/reliability_gate · ADR-0175

Eight invariants, enforced by tests that fail the moment any of them slips. Not promises. Not policies. Properties.

This is what AI safety looks like when it stops being a research direction and starts being a substrate.

Safety in CORE is not policy on top of intelligence. It is the substrate intelligence runs on.

Properties, not policies.

Every decode carries the nature of its own grounding.

Two regimes. One membrane. The math, not the policy.

The calibration loop closing on itself.

Trust earned by volume, not by a lucky streak.

Safety in CORE is not policy on top of intelligence.
It is the substrate intelligence runs on.