Three distinct frontier failure modes
Same problem. Different architecture.
| Model | Accuracy | Refusal correctness | Wall time | Failure pattern |
|---|---|---|---|---|
| CORE (local Python) | 100.0% (185/185) | 100% (3/3) | 33 ms | deterministic, exact |
| claude-sonnet-4-6 | 98.4% (182/185) | 33.3% (1/3) | 294 s | confabulates on out-of-scope |
| claude-opus-4-7 | 96.2% (178/185) | 33.3% (1/3) | 309 s | pattern-shortcuts on near-misses + confabulates |
| gpt-5 | 72.4% (134/185) | 33.3% (1/3) | 1,153 s | over-refuses in-scope (literally replied "REFUSED" on x·(x+1) = x²+x) |
| qwen3:8b (local) | did not complete | — | >30 min | — |
CORE's 33 ms vs gpt-5's 19 minutes = ~35,000× faster, 27.6 pp higher accuracy, and the only system that correctly refused exactly what should be refused.
Test domain: mathematical identities (mathematics_logic, ratified expert per CLAIMS.md). Reproducible via the CORE CLI; frontier model timings end-to-end on public APIs at named model versions.
Why this is different physics, not better tuning
Three engineering pillars. None of them optional.
Mechanical Sympathy
Software that understands the machine it runs on.
Unified memory. Algebra on the CPU. Tensors on the Neural Engine. Three languages, three hardware domains, zero PCIe crossings in the hot path. Intelligence that ignores its substrate is wasted intelligence.
Semantic Rigor
Every word means exactly what it says.
A versor is a versor. CGA distance is exact. Recall is exact. There are no thresholds tuned for "good enough." Rigor is what separates an engine from a heuristic.
The Third Door
No transformer. No ANN. No sampling. No tokenizer. No gradient descent.
Each was a door the industry walked through. Each was a door we refused. CORE is what is behind the third door — built from first principles, calibrated to a higher bar than the field's standard, and answerable to math the field tried to skip.
The cognitive turn
A glass-walled engine block.
CORE does not generate. It moves through a deliberate, geometric, irreversible sequence at every turn. Every station is named. Every station has a contract. Every station is replayable.
A new reasoning primitive — merged, proven, and deliberately not yet serving
Propositional reasoning, built before it is trusted to serve.
CORE now has a propositional reasoning primitive, built end-to-end: a canonicalizer that reduces any propositional formula to a ROBDD so that logical equivalence is byte-equality (ADR-0201); a typed out-of-regime refusal for logic it does not yet decide (ADR-0201.1); an acyclicity guard that makes circular dependencies unrepresentable (ADR-0203); a proof-graph builder (ADR-0204); and the first inference rule — modus_ponens with a pooling-based disagreement rule that refuses the moment the premises admit more than one conclusion (ADR-0205). An independently authored adversarial corpus — written against the spec, not the code — validates it 24 / 24.
And it is sealed from serving on purpose. None of this is wired into the answers a user sees. The reliability machinery exists; by invariant it is not yet allowed to serve. The primitive is sound over its declared atoms — grounding it in recognized input is the next phase, not a shipped claim. Building the guarantee before granting it the right to serve is the discipline, not a limitation.
How it learns without ever risking a served answer
Being wrong is the elimination signal — not a failure.
A system that refuses whenever it is unsure is safe but frozen — it can only ever learn what a human hands it. CORE resolves this with two regimes under one seal. What it serves is held to wrong==0. What it practices is free to be wrong — on material where being wrong is checkable and never served. There, a wrong attempt is the signal that drives elimination and learning, not a failure that reaches a user.
The seal
Practice never writes to serving. It emits proposals that carry their own proof — round-tripped, forced-unique, replay-stable, introducing zero wrong — and nothing reaches a served answer except through a reviewed gate. Practice can be as bold as its calibration allows precisely because its mistakes are structurally unable to become a served answer.
Earned, not granted
Reliability per capability class is a conservative lower bound — a one-sided Wilson floor that approaches certainty only as the record grows. It is earned by volume, never by a lucky streak. Refusals never count; refusing is always safe. An error costs more standing than a success buys, and standing is re-earned only by more clean work.
Licensed by a ratio, not a reward
An action is permitted only when measured reliability clears a human-set ceiling for that action's blast radius — a deterministic ratio, not reinforcement learning. The ceilings are version-controlled constants. The engine can never raise its own ceiling: widening autonomy is always a human lowering the bar for a class the record has already earned.
Audited by gold it cannot see
A live anchor set of known answers runs continuously, measuring whether the engine's self-verification is actually trustworthy per class. When that trust drifts, appetite contracts automatically. Gold does not just teach the engine — it audits whether the engine can be trusted to teach itself.
As a class earns reliability, the engine is licensed to serve more of it — and the human role narrows from authoring meaning to curating what it ingests and ratifying what it has already proven. It narrows; it does not disappear.
Honest status: the substrate for this — the per-class ledger, the conservative floor, the gate, the ceilings — is built and runs in its own lane, sealed from serving by invariant. The full attempt-and-earn loop is the active work, not a shipped claim. Building the machinery that earns trust before granting it the right to serve is the discipline. ADR-0175
Today, and the trajectory
Honest about what works. Honest about where it is going.
Today
- 6 of 50 real GSM8K problems solved. 44 refused on the served path. wrong==0.
- mathematics_logic: expert. physics: audit-passed. systems_software: audit-passed.
- Every learned belief enters at SPECULATIVE. Promotion to COHERENT requires coherence review.
- Today, every promotion is human-in-the-loop.
The trajectory
- Refusal is calibration, not the ceiling.
- Each new domain expands by the same reviewed, ratified process — never by gradient descent, never by bulk ingest.
- As a class earns reliability, the engine is licensed to serve more of it — graduating toward serving, not gated by human review forever.
Implications
What changes when the substrate changes.
For robotics.
Sub-millisecond decisions in 20W. Transformers cannot fit. CORE was designed for this substrate from the first commit.
For the edge.
Real edge AI is not a quantized cloud model. It is an architecture small enough to ship in the binary, deterministic enough to certify, refusable enough to trust.
For alignment.
Alignment is not a classifier on top of a sampler. It is the algebraic property of a substrate where identity is a geometric trajectory and corruption is a structural violation.
For trust.
Every answer carries its provenance. Every refusal has a named reason. Every learned belief has a tier and a retraction path. The engine cannot lie about what it knows because the substrate does not admit the move.
Continue