Remove the verifier and most “formal reasoning engines” collapse into persuasive autocomplete. The real progress from 2023 to March 2026 did not come from models suddenly learning pure deduction. It came from changing the system boundary: retrieval narrowed the search space, proof assistants and solvers rejected invalid steps, and repair loops turned deterministic failures into usable feedback. That pattern runs from LeanDojo to AlphaProof to VERINA and WybeCoder. S1 S2 S7 S9
That distinction matters because the headline numbers are finally good enough to expose both the progress and the limit. On March 16, 2026, the latest VERINA revision still showed a large gap between “code that runs” and “code that is proved”: the best model reached 72.6% code correctness and 52.3% specification soundness/completeness, but only 4.9% proof success in one trial. On March 31, 2026, WybeCoder pushed much further by making verification itself agentic, solving 74% of Verina tasks at moderate compute. The lesson is blunt: if correctness matters, the winning move is not “more chain-of-thought.” It is a tighter verifier loop. S7 S9
At a Glance
- Premise retrieval is not a convenience layer. LeanDojo made it first-class by exposing accessible premises and a 98,734-theorem benchmark, and ReProver beat non-retrieval baselines plus GPT-4 with only one GPU week of training. S2
- Tool choice changes outcomes. Lam et al. showed that solver and symbolic-language choice can produce nearly 50% variation in tool-executable translations, which means the NL-to-logic bridge is core architecture, not plumbing. S3
- Compiler-guided repair is one of the highest-leverage improvements in the stack. APOLLO used Lean compiler feedback to push sub-8B theorem proving to 84.9% on miniF2F while keeping the sampling budget below one hundred, and it lifted general-purpose models from roughly 3-7% to over 40%. S4
- Benchmark hype still overruns proof reality. miniF2F-Lean Revisited found the end-to-end pipeline could be about 36% despite literature numbers of 97% autoformalization and 69% theorem proving; the corrected miniF2F-v2 improves to 70%, which is better but still not “solved.” S6
- Verified code generation is starting to move from research demo to engineering surface. VERINA exposed the bottleneck, and WybeCoder showed that code, invariants, and proofs can co-evolve in one agentic loop. S7 S9
The Architecture That Actually Works
This chapter matters because most teams still put the LLM on the wrong side of the trust boundary. In a reliable formal reasoning engine, the model is not the authority. The verifier is.
The architecture that keeps showing up in the strongest systems has six stages:
- A problem statement in natural language, code comments, or partially formal text that defines what must be proved or synthesized.
- A retrieval step that selects accessible premises, lemmas, imports, or prior states.
- A proposal step that translates the problem into candidate tactics, proof fragments, invariants, or formal specifications.
- A verifier such as Lean, Rocq, Isabelle, or an SMT solver that deterministically accepts or rejects the candidate.
- A repair controller that feeds concrete error traces back into the model and retries under a bounded budget.
- A certified output stage that returns only the artifact the verifier accepted. S1 S2 S4 S5
The trust boundary is the important design decision here. A model can help upstream of the verifier, but it should not get the final say downstream of it.
LeanDojo is still the cleanest open example of why retrieval belongs in the core loop. Its contribution was not just a public benchmark. It exposed fine-grained premise annotations, programmatic interaction with Lean, and a benchmark with 98,734 theorems and proofs where generalization requires handling premises never seen in training. That moves theorem proving away from brute-force prompt sampling and toward state-aware retrieval. S2
Lam et al. make the next point even sharper. Once you insert a symbolic solver, translation quality becomes a first-order systems problem. Their ALTA 2024 paper reports nearly 50% variation in tool-executable symbolic translations across Z3, Pyke, and Prover9 pipelines. In plain English: the same base model can look smarter or dumber depending on which formal language you force it to speak. S3
That is why the modern loop looks more like this than like raw “reasoning”:
goal = formalize(user_problem)
context = retrieve_accessible_premises(goal, repo_state)
candidate = llm.propose(goal, context)
for attempt in range(max_attempts):
verdict = verifier.check(candidate)
if verdict.ok:
emit_certificate(candidate)
break
candidate = llm.repair(candidate, verdict.errors, context)
Tradeoff: This architecture is slower and less elegant than single-pass text generation. You pay with verifier calls, state management, and search control.
Failure mode: If you let the model free-write long proofs or specs without explicit proof state and accessible context, you get plausible garbage that fails late and costs a lot.
Mitigation: Keep the proof state explicit, restrict premise selection to reachable items, and treat verifier errors as structured control signals rather than as plain text noise. APOLLO and the Rocq case study both point in that direction. S4 S5
Benchmark Reality: Where the Field Actually Is
The benchmark question is not whether a model can emit proof-shaped text. It is whether the full system can land a verified artifact under a strict evaluator.
The field is improving, but proof success is still the scarce resource. Any article that only shows “reasoning” demos and hides verifier pass rates is selling theater.
The short version looks like this:
| System | Date | What the result actually says | Why it matters |
|---|---|---|---|
| AlphaProof | July 25, 2024 announcement; methodology published November 12, 2025 | Solved 4 of 6 IMO 2024 problems for 28/42 points, silver-medal level, using formalization plus search in Lean | Formal verification plus RL search can scale to elite math, but only with heavy formalization and search machinery. S1 |
| miniF2F-Lean Revisited | November 5, 2025 | End-to-end pipeline accuracy was about 36% versus much larger component-level claims; corrected miniF2F-v2 reaches 70% | Benchmark quality can distort perceived progress. Component wins do not automatically compose. S6 |
| APOLLO | May 9, 2025 initial submission; updated November 4, 2025 | 84.9% miniF2F accuracy among sub-8B models with low sampling budget; general-purpose models jump from low single digits to above 40% through repair | Compiler-guided repair changes the economics of theorem proving. S4 |
| Goedel-Prover-V2 | August 5, 2025 | 88.1% MiniF2F pass@32 in standard mode and 90.4% in self-correction mode; 86 PutnamBench problems at pass@184 | Specialized provers now outperform much larger open models when the feedback loop is well tuned. S8 |
| VERINA | May 29, 2025 initial release; v3 March 16, 2026 | 72.6% code correctness, 52.3% spec soundness/completeness, 4.9% proof success for the best general-purpose model | Verified code generation is still mostly a proof bottleneck, not a code-synthesis bottleneck. S7 |
| WybeCoder | March 31, 2026 | 74% of Verina tasks and 62% of Clever tasks at moderate compute | Agentic verification loops can now push beyond the plateau that VERINA exposed. S9 |
Three conclusions fall out of those results.
First, proof success does not track code success. VERINA is the clearest example: code, specification, and proof are three different competence layers, and the third one still lags badly. S7
Second, repair depth and compute budget matter more than most benchmark tables admit. A model with self-correction, compiler feedback, and a bounded search policy is not directly comparable to a single-pass baseline, even if both are “theorem provers.” S4 S8
Third, benchmark curation quality is part of the science. miniF2F-Lean Revisited showed how easy it is to over-read progress when formal and informal problem statements are misaligned. S6
Tradeoff: Once you compare full verifier loops instead of raw model outputs, evaluation becomes more honest but also more expensive and harder to normalize.
Failure mode: Teams report best-of-many proof results without clearly stating retry count, pass@k budget, or benchmark corrections. That makes the system look more reliable than it is.
Mitigation: Always publish pass@1 and pass@k, state the repair budget, and prefer end-to-end benchmarks where code, specification, and proof all need to line up. S6 S7
Three Implementation Patterns Worth Using
This chapter matters because the current research is finally concrete enough to steal from. You do not need to reproduce AlphaProof or train a theorem prover from scratch to benefit from the architecture.
1. Retrieval-First Proving
If your environment already lives in Lean, Rocq, Isabelle, or a typed spec language, retrieval is the first thing to operationalize. LeanDojo’s main insight was that premise selection is not a side quest. It is the search-space controller. By exposing reachable premises and hard negatives, it made retrieval better and cheaper at the same time. S2
Minimal pattern:
state = prover.current_state()
premises = retriever.top_k_reachable(state, k=32)
tactic = llm.propose_tactic(state, premises)
result = prover.apply(tactic)
Tradeoff: Aggressive premise filtering reduces search cost, but it also increases the chance of silently dropping the one lemma that matters.
Failure mode: The retriever learns the training distribution of theorem libraries and misses novel but accessible premises in new projects.
Mitigation: Log premise recall against solved proofs, keep a fallback “wider retrieval” mode, and evaluate on splits that require novel premises, not just memorized libraries. LeanDojo’s benchmark design is still a good template here. S2
2. Compiler-Guided Repair Instead of Blind Resampling
APOLLO is important because it treats theorem proving like debugging, not divination. It lets the model fail, collects Lean feedback, isolates subgoals, applies automated solvers where possible, and only then asks the model to repair the residue. That is a much better use of tokens than sampling thousands of whole proofs. S4
The 2025 Rocq case study points in the same direction from a different angle. On the hs-to-coq and Verdi projects, Bayazıt, Li, and Si found that external dependencies and same-file context significantly help proof generation, and that performance varies materially across verification projects. Translation: proof automation is deeply context-sensitive. S5
Minimal pattern:
candidate = llm.prove(goal, file_context, deps)
for round in range(8):
verdict = lean.check(candidate)
if verdict.ok:
return candidate
subgoals = isolate_failing_subgoals(verdict)
candidate = llm.repair(candidate, errors=verdict.errors, subgoals=subgoals)
raise ProofFailure
Tradeoff: Repair loops add latency and orchestration complexity, especially if you split failing proofs into sub-problems and invoke auxiliary solvers.
Failure mode: The system burns compute “repairing” proofs that were based on the wrong formal statement from the start.
Mitigation: Separate statement validation from proof search. If the formal goal looks unstable, stop and review the translation instead of iterating forever on a broken target. S3 S5
3. Prove-as-You-Generate Verified Code
The interesting 2026 shift is that formal reasoning is no longer confined to math benchmarks. VERINA created a clean arena for jointly generating code, formal specifications, and proofs on 189 manually curated Lean tasks. Then WybeCoder showed how to turn that into an engineering loop where code, invariants, and proofs co-evolve. S7 S9
Minimal pattern:
natural-language task
-> candidate implementation
-> candidate specification
-> verification conditions
-> SMT discharge where possible
-> Lean fallback for unresolved obligations
-> invariant synthesis and proof repair
-> verified artifact
Tradeoff: This approach is operationally heavier than ordinary code generation. You need spec generation, VC generation, solver routing, and proof orchestration.
Failure mode: The code generator and spec generator drift into a locally consistent but semantically wrong pair. Everything type-checks, but the system proved the wrong thing.
Mitigation: Make specification review explicit, especially on safety-critical paths. Formal verification removes one class of error very well, but it cannot recover intent that was mis-formalized upstream. VERINA and the tool-based reasoning literature both show that translation is where a lot of the real risk lives. S3 S7
Operational Risks and Mitigations
This chapter matters because formal reasoning engines fail differently from ordinary coding copilots. They do not just hallucinate APIs. They can formalize the wrong claim, retrieve the wrong theorem, or burn huge amounts of compute exploring dead proof branches.
| Risk | Why it happens | Mitigation |
|---|---|---|
| Semantic drift in formalization | Natural language intent is underspecified, and the translator must commit to one formal reading | Add human review on formal statements before long proof runs; treat spec review as a gate, not a nice-to-have. S3 S7 |
| Executable-translation failure | Each solver has its own symbolic language and brittleness profile | Measure tool-executable rate separately from final accuracy; do not swap solvers without re-benchmarking the translator. S3 |
| Context starvation | The model sees too little of the local theorem library, imports, or file-level definitions | Inject reachable dependencies and same-file context; evaluate on real projects, not just toy benchmarks. S2 S5 |
| Search-space explosion | Proof search, invariant synthesis, and self-correction all multiply compute | Bound retries, track proof-token cost per success, and route obvious subgoals to cheaper automated solvers first. S4 S9 |
| Benchmark overestimation | Component metrics are reported without full end-to-end composition or benchmark cleanup | Use corrected or end-to-end benchmarks such as miniF2F-v2 and VERINA-style evaluations. S6 S7 |
The operational trap is pretending that formal methods eliminate uncertainty everywhere. They do not. They eliminate uncertainty after the formal statement is fixed. Everything before that step is still vulnerable to translation errors, missing context, and bad task decomposition.
Observability and SLO Model
The papers do not give you a production SLO table. This one is my operational translation of their failure modes.
You do not have a reliable reasoning engine unless you can measure at least these:
executable_translation_ratereachable_premise_recall_at_kverifier_pass_rate_pass1verifier_pass_rate_pass8repair_yield_per_roundproof_token_cost_per_successspec_review_reject_rateinvariant_acceptance_ratiomedian_verifier_latency_mshuman_override_rate
The most important metric is not “reasoning score.” It is verified artifacts per unit of cost. A model that writes elegant proof-shaped text and rarely closes the loop is not production-ready.
I would use gates like these before rollout:
| Dimension | Gate |
|---|---|
| Translation quality | executable_translation_rate >= 95% on the target task family |
| Verifier reliability | pass@1 is tracked, and pass@8 improves materially without runaway token cost |
| Repair efficiency | Additional rounds stop when marginal proof yield flattens |
| Intent integrity | Formal statement review escapes stay below an agreed audit threshold |
| Cost control | Proof-token cost per success does not grow faster than throughput value |
Failure mode: Teams watch only final success rate and ignore whether the improvement came from real reasoning, wider retrieval, or simply throwing more compute at retries.
Mitigation: Track pass@1, pass@k, and cost together. If you only show the last number, you are hiding the system. S4 S7 S9
Rollout Plan With Checkpoints
This chapter matters because formal reasoning engines are not an all-at-once replacement for human review. They are a control plane you layer in stages.
Phase 1: Benchmark Your Real Task
Start with offline evaluation on a narrow slice: protocol invariants, parser correctness, arithmetic kernels, or theorem-heavy libraries. Do not start with free-form product code.
Checkpoint: If you cannot write crisp formal statements for the slice, you are not ready for proof automation there.
Phase 2: Add Retrieval Before You Add Autonomy
Bring in theorem-library or project-level retrieval first. Measure premise recall, translation executable rate, and verifier pass@1.
Checkpoint: If retrieval does not materially improve verifier pass@1, your premise model or context packing is weak.
Phase 3: Add Repair Loops Under Budget
Only after pass@1 is stable should you add APOLLO-style repair. Cap retries. Route cheap subgoals to deterministic solvers first.
Checkpoint: If pass@k rises but proof-token cost explodes, you have a search problem, not a reasoning breakthrough.
Phase 4: Move to Verified Code Generation
Once the theorem-proving slice is stable, extend into code/spec/proof co-generation for a narrow class of algorithms or components. This is where VERINA and WybeCoder become useful templates. S7 S9
Checkpoint: If code correctness rises faster than proof success, keep the verifier loop narrow. You are still in assisted coding, not verified coding.
Phase 5: Keep the Human at the Intent Boundary
Human reviewers should exit the inner proof loop over time, but they should stay at the formalization boundary for much longer.
Checkpoint: If the team starts saying “the proof passed, so the requirement must be right,” stop and tighten the statement-review process.
Bottom Line
Formal reasoning engines are already useful. They are just useful in a narrower, more disciplined way than the marketing suggests.
The winning pattern is clear:
- the model proposes;
- retrieval narrows the search;
- the verifier decides;
- repair loops recover from concrete failures;
- humans stay responsible for intent.
That is why I think “beyond pattern matching” is the right frame. The field did not escape statistical generation. It learned how to put statistical generation inside a deterministic shell.
If your system cannot emit a proof object, a solver-backed verdict, or a certificate that survives re-checking, call it assistance. Do not call it reasoning.
