Three distinct frontier failure modes

Same problem. Different architecture.

Model	Accuracy	Refusal correctness	Wall time	Failure pattern
CORE (local Python)	100.0% (185/185)	100% (3/3)	33 ms	deterministic, exact
claude-sonnet-4-6	98.4% (182/185)	33.3% (1/3)	294 s	confabulates on out-of-scope
claude-opus-4-7	96.2% (178/185)	33.3% (1/3)	309 s	pattern-shortcuts on near-misses + confabulates
gpt-5	72.4% (134/185)	33.3% (1/3)	1,153 s	over-refuses in-scope (literally replied "REFUSED" on x·(x+1) = x²+x)
qwen3:8b (local)	did not complete	—	>30 min	—

CORE's 33 ms vs gpt-5's 19 minutes = ~35,000× faster, 27.6 pp higher accuracy, and the only system that correctly refused exactly what should be refused.

Test domain: mathematical identities (mathematics_logic, ratified expert per CLAIMS.md). Reproducible via the CORE CLI; frontier model timings end-to-end on public APIs at named model versions.

100% / 100% / 100%

accuracy · refusal · correctness

~35,000×

faster than gpt-5

wrong==0

on the served path, by construction

Why this is different physics, not better tuning

Three engineering pillars. None of them optional.

Mechanical Sympathy

Software that understands the machine it runs on.

Unified memory. Algebra on the CPU. Tensors on the Neural Engine. Three languages, three hardware domains, zero PCIe crossings in the hot path. Intelligence that ignores its substrate is wasted intelligence.

Semantic Rigor

Every word means exactly what it says.

A versor is a versor. CGA distance is exact. Recall is exact. There are no thresholds tuned for "good enough." Rigor is what separates an engine from a heuristic.

III

The Third Door

No transformer. No ANN. No sampling. No tokenizer. No gradient descent.

Each was a door the industry walked through. Each was a door we refused. CORE is what is behind the third door — built from first principles, calibrated to a higher bar than the field's standard, and answerable to math the field tried to skip.

The cognitive turn

A glass-walled engine block.

CORE does not generate. It moves through a deliberate, geometric, irreversible sequence at every turn. Every station is named. Every station has a contract. Every station is replayable.

listen ▸ comprehend ▸ recall ▸ think ▸ articulate ▸ learn (reviewed) ▸ replay

A new reasoning primitive — merged, proven, and deliberately not yet serving

Propositional reasoning, built before it is trusted to serve.

CORE now has a propositional reasoning primitive, built end-to-end: a canonicalizer that reduces any propositional formula to a ROBDD so that logical equivalence is byte-equality (ADR-0201); a typed out-of-regime refusal for logic it does not yet decide (ADR-0201.1); an acyclicity guard that makes circular dependencies unrepresentable (ADR-0203); a proof-graph builder (ADR-0204); and the first inference rule — modus_ponens with a pooling-based disagreement rule that refuses the moment the premises admit more than one conclusion (ADR-0205). An independently authored adversarial corpus — written against the spec, not the code — validates it 24 / 24.

And it is sealed from serving on purpose. None of this is wired into the answers a user sees. The reliability machinery exists; by invariant it is not yet allowed to serve. The primitive is sound over its declared atoms — grounding it in recognized input is the next phase, not a shipped claim. Building the guarantee before granting it the right to serve is the discipline, not a limitation.

merged & proven sealed from serving grounding designed next

How it learns without ever risking a served answer

Being wrong is the elimination signal — not a failure.

A system that refuses whenever it is unsure is safe but frozen — it can only ever learn what a human hands it. CORE resolves this with two regimes under one seal. What it serves is held to wrong==0. What it practices is free to be wrong — on material where being wrong is checkable and never served. There, a wrong attempt is the signal that drives elimination and learning, not a failure that reaches a user.

The seal

Practice never writes to serving. It emits proposals that carry their own proof — round-tripped, forced-unique, replay-stable, introducing zero wrong — and nothing reaches a served answer except through a reviewed gate. Practice can be as bold as its calibration allows precisely because its mistakes are structurally unable to become a served answer.

Earned, not granted

Reliability per capability class is a conservative lower bound — a one-sided Wilson floor that approaches certainty only as the record grows. It is earned by volume, never by a lucky streak. Refusals never count; refusing is always safe. An error costs more standing than a success buys, and standing is re-earned only by more clean work.

Licensed by a ratio, not a reward

An action is permitted only when measured reliability clears a human-set ceiling for that action's blast radius — a deterministic ratio, not reinforcement learning. The ceilings are version-controlled constants. The engine can never raise its own ceiling: widening autonomy is always a human lowering the bar for a class the record has already earned.

Audited by gold it cannot see

A live anchor set of known answers runs continuously, measuring whether the engine's self-verification is actually trustworthy per class. When that trust drifts, appetite contracts automatically. Gold does not just teach the engine — it audits whether the engine can be trusted to teach itself.

As a class earns reliability, the engine is licensed to serve more of it — and the human role narrows from authoring meaning to curating what it ingests and ratifying what it has already proven. It narrows; it does not disappear.

Honest status: the substrate for this — the per-class ledger, the conservative floor, the gate, the ceilings — is built and runs in its own lane, sealed from serving by invariant. The full attempt-and-earn loop is the active work, not a shipped claim. Building the machinery that earns trust before granting it the right to serve is the discipline. ADR-0175

Today, and the trajectory

Honest about what works. Honest about where it is going.

Today

6 of 50 real GSM8K problems solved. 44 refused on the served path. wrong==0.
mathematics_logic: expert. physics: audit-passed. systems_software: audit-passed.
Every learned belief enters at SPECULATIVE. Promotion to COHERENT requires coherence review.
Today, every promotion is human-in-the-loop.

The trajectory

Refusal is calibration, not the ceiling.
Each new domain expands by the same reviewed, ratified process — never by gradient descent, never by bulk ingest.
As a class earns reliability, the engine is licensed to serve more of it — graduating toward serving, not gated by human review forever.

Implications

What changes when the substrate changes.

For robotics.

Sub-millisecond decisions in 20W. Transformers cannot fit. CORE was designed for this substrate from the first commit.

For the edge.

Real edge AI is not a quantized cloud model. It is an architecture small enough to ship in the binary, deterministic enough to certify, refusable enough to trust.

For alignment.

Alignment is not a classifier on top of a sampler. It is the algebraic property of a substrate where identity is a geometric trajectory and corruption is a structural violation.

For trust.

Every answer carries its provenance. Every refusal has a named reason. Every learned belief has a tier and a retraction path. The engine cannot lie about what it knows because the substrate does not admit the move.

Continue

Five doorways. Walk them in any order.

AI Safety

Eight invariants, enforced by tests.

World Safety

What it means for humanity.

Identity

A real personality, woven through every decision.

Open Source

Why we gave the work away.

About

Built by one person, for you.