Automated-Verification

Remove the verifier and most “formal reasoning engines” collapse into persuasive autocomplete. The real progress from 2023 to March 2026 did not come from models suddenly learning pure deduction. It came from changing the system boundary: retrieval narrowed the search space, proof assistants and solvers rejected invalid steps, and repair loops turned deterministic failures into usable feedback. That pattern runs from LeanDojo to AlphaProof to VERINA and WybeCoder. S1 S2 S7 S9 That distinction matters because the headline numbers are finally good enough to expose both the progress and the limit. On March 16, 2026, the latest VERINA revision still showed a large gap between “code that runs” and “code that is proved”: the best model reached 72.6% code correctness and 52.3% specification soundness/completeness, but only 4.9% proof success in one trial. On March 31, 2026, WybeCoder pushed much further by making verification itself agentic, solving 74% of Verina tasks at moderate compute. The lesson is blunt: if correctness matters, the winning move is not “more chain-of-thought.” It is a tighter verifier loop. S7 S9 ...