On April 16, 2026, I replaced my earlier local Gemma run with a heavier stack:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --jinja
The result was not “a slightly better chat model.” The result was a qualitatively different local agent loop.
This time the agent did not stop at repo reconnaissance, rough planning, or code scaffolding. It wrote a research narrative, generated a slide deck module by module, rebuilt after failures, converted the deck to PDF, rasterized slides for visual inspection, read the resulting PNGs, ran text extraction against the .pptx, checked for placeholder residue, and then closed the loop with targeted repairs. That is not AGI. But it is no longer a toy local demo either.
My April 4 write-up on pi + Gemma 4 argued that the floor for local agentic coding had moved. After this Qwen3.6 run, my view is stronger: the floor has moved again. The important shift is not that the model became perfect. The important shift is that a local open-weight stack can now complete a multi-artifact execution loop with its own QA and repair phases, as long as you constrain it with deterministic checks and refuse to confuse fluency with correctness.
Context and Status on April 16, 2026
The official Qwen3.6 model card positions Qwen3.6-35B-A3B as the first open-weight Qwen3.6 variant, explicitly centered on agentic coding, repository-level reasoning, and “thinking preservation.” It lists 35B total parameters with 3B activated, 256 experts, 8 routed plus 1 shared active experts, and a native 262,144-token context window extensible to about 1.01M with YaRN-style scaling. S1
Architecturally, this is not a standard dense transformer. The model card specifies a hybrid layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)). That means 75% of the model’s layers use Gated DeltaNet, a linear attention variant, while only 25% use full quadratic attention. S1 This matters for local inference: linear attention layers are faster and use less memory per token than standard attention, which is part of why a 35B-total model with only 3B activated can run on consumer hardware without collapsing under context pressure.
That positioning matters because this is not just another instruct checkpoint with a marketing paragraph about “agents.” Qwen is publishing benchmark numbers against coding-agent workloads directly in the model card. On the coding-agent slice, it reports 73.4 on SWE-bench Verified, 51.5 on Terminal-Bench 2.0, 68.7 on Claw-Eval Avg, and 52.6 on QwenClawBench. On tool-use benchmarks specifically, it scores 37.0 on MCPMark and 62.8 on MCP-Atlas. S1 One important caveat: the SWE-bench numbers use Qwen’s own internal agent scaffold with bash and file-edit tools at temp=1.0, top_p=0.95, and a 200K context window. S1 These are self-reported scores, not independently reproduced. They still explain why this model family is materially more relevant to terminal-native agent work than a generic chat release, but treat them as directional evidence, not ground truth for your own stack.
Unsloth’s GGUF release is also not incidental. Their Qwen3.6 page explicitly calls out developer-role support for Codex/OpenCode-style terminals and tool-calling improvements, while publishing a full spread of Dynamic 2.0 quants up through 8-bit. On the 8-bit tier, the page lists Q8_0 at 36.9 GB and UD-Q8_K_XL at 38.5 GB, which is heavy but still realistic on high-memory Apple Silicon or a workstation with enough RAM. S2
That combination is what made the run interesting on April 16, 2026:
- a model family explicitly optimized for agentic coding S1
- a GGUF distribution path designed for local runtimes and tool-calling robustness S2 S7
llama.cppexposing an OpenAI-compatible local server with direct Hugging Face fetches via-hfS3piacting as the thin terminal harness that feeds the model files, tools, skills, images, and shell feedback S5
The future did not arrive because one benchmark number moved. It arrived because the composition finally crossed a practical threshold.
The Exact Stack I Ran
The whole stack ran on a MacBook Pro with an M4 Max and 64 GB unified memory. UD-Q8_K_XL at ~38.5 GB fits comfortably with headroom for the OS, llama-server KV cache, and the post-processing toolchain (soffice, magick, markitdown). That headroom matters: you want the agent loop to finish without swapping, not just to load the weights.
The architecture was simple enough to fit in one terminal, which is part of the point:
flowchart LR U["User request"] --> P["pi terminal harness"] P --> C["AGENTS.md + skills + session tree"] P --> T["read / write / edit / bash"] P --> S["llama-server"] S --> M["Unsloth Qwen3.6 35B A3B GGUF"] T --> B["deck build"] B --> V["visual QA"] V --> R["repair loop"] R --> B
pi is well suited to exactly this kind of experiment because it stays deliberately thin. Its README describes it as a “minimal terminal coding harness,” with four default tools (read, write, edit, bash), session branching, compaction, AGENTS loading, and on-demand skills. S5 What it explicitly does not include is equally important: no MCP, no sub-agents, no plan mode, no permission popups. S5 Those are all features you can add through TypeScript extensions or pi packages, but the core refuses to bake them in. That design tradeoff matters operationally: less hidden orchestration means less mystery when you are debugging a local agent loop.
llama.cpp is the second critical layer. The project documents llama-server -hf <repo> as a direct path to launching an OpenAI-compatible local API, and explicitly frames GGUF + quantized inference as part of its normal operating model. S3 For this run, that meant I could point a terminal agent at a local endpoint without rewriting my harness around a custom runtime protocol.
The third layer is the model-specific template path. Qwen-family chat formatting is not a side issue. The llama.cpp template wiki explains that it reads tokenizer.chat_template metadata and includes a built-in Jinja parser, minja, to apply those templates. S4 That is exactly the category of detail that separates “the model loads” from “the agent behaves correctly over multiple turns.”
Finally, the post-processing stack mattered:
node build.cjs
soffice --headless --convert-to pdf deck.pptx --outdir qa
magick -density 200 qa/deck.pdf -resize 1920x1080 qa/slide-%02d.png
python3 -m markitdown deck.pptx | grep -Ei "xxx|lorem|placeholder|TODO"
That last step is more important than it looks. Microsoft positions MarkItDown as a document-to-Markdown utility for LLM and text-analysis pipelines, and explicitly lists PowerPoint among the supported formats while warning that it is meant for structural text extraction rather than high-fidelity presentation conversion. S6 In other words, it is exactly right for content QA and exactly wrong for visual publishing. That makes it a perfect gate in a self-healing agent loop.
Why Qwen3.6 Changed the Result
The short answer is that Qwen3.6 looks like a model trained with this use case in mind.
The longer answer is that several details stack on top of each other:
| Layer | Relevant fact | Why it changed the run |
|---|---|---|
| Qwen3.6 core model | 35B total, 3B activated, hybrid DeltaNet/attention architecture, 262K native context S1 | Linear attention layers reduce memory pressure; MoE activation keeps compute manageable on local hardware |
| Qwen3.6 API semantics | Preserved thinking traces in agent scenarios reduce “redundant reasoning” and improve “decision consistency” S1 | Multi-step execution stays coherent because the model reuses prior reasoning instead of re-deriving it each turn |
| Unsloth GGUF release | Developer-role support and tool-calling improvements are called out directly S2 | The local terminal experience becomes less brittle when role and tool conventions line up |
| Dynamic 2.0 quantization | Unsloth selectively quantizes every layer using per-layer KL divergence minimization, with a >1.5M-token calibration dataset tuned for chat S7 | The quant is not just smaller; it is actively minimizing behavioral drift from the full-precision model |
| llama.cpp server path | OpenAI-compatible serving plus direct HF fetch keeps the stack boring S3 | The harness can focus on task execution instead of runtime glue |
What I observed on top of that source-backed stack was this:
- the agent could keep a narrative plan in working memory long enough to finish a 16-slide deck
- it could accept runtime feedback without collapsing into generic apology mode
- it could move from generation to verification to repair without external handholding at every micro-step
- it could treat rendered images as part of the task surface rather than as an afterthought
That last point is subtle. The Qwen3.6 model card is for an image-text-to-text model with a vision encoder. S1 llama.cpp also now documents multimodal support in llama-server. S3 I am not claiming a full benchmarked local multimodal stack from one session. I am saying the workflow was operationally multimodal enough to inspect rendered slides and keep moving. That is already strategically important.
The Loop That Crossed the Threshold
The most important part of the run was not the initial generation. It was the closed loop.
From the session log, the agent executed a sequence that looked like this:
- Plan the slide deck structure and create slide modules.
- Write the modules to disk, one file at a time.
- Rebuild the deck.
- Notice concrete defects in generated code and patch them.
- Convert the
.pptxto PDF. - Rasterize the PDF to slide images.
- Read slide PNGs back into the agent for visual QA.
- Extract markdown from the
.pptxand grep for placeholder residue. - Confirm all slides were extracted and then summarize the QA outcome.
That is the real story. Not “the model felt smarter.” The model stayed inside a deterministic artifact loop long enough to make itself useful.
Three specific moments from the session are worth calling out.
1. It repaired structural code defects instead of restarting from scratch
The run did not succeed on the first build. It patched a missing LAYOUT import in 02_exec_summary.cjs, then repaired table header definitions in 05_five_levels.cjs, 10_open_source_tools.cjs, and 14_sdlc_coverage.cjs, rebuilt the deck, and continued. That is a materially better failure mode than the one I saw in the Gemma 4 run, where exactness degraded earlier and more visibly on layout-sensitive work.
The tradeoff is that the repair was still local and tactical. It fixed what the current build surfaced. It did not prove global correctness. The mitigation is to keep the loop instrumented with explicit gates instead of trusting the agent’s own confidence language.
2. It treated visual QA as part of execution, not as a human-only afterthought
After rebuilding the deck, the agent converted the .pptx to PDF, turned the PDF into slide PNGs, listed the generated images, and then read them back one by one. That is the right pattern for agentic artifact work: if the output is visual, the verification surface must also be visual.
The tradeoff is latency. Visual QA turns one artifact into multiple secondary artifacts, and every one of those files consumes wall-clock time, tokens, or both. The mitigation is to reserve visual QA for high-value checkpoints: title slide, dense table slide, visual-summary slide, and final pass.
3. It used text extraction as a second QA channel
The python3 -m markitdown ... | grep ... step is exactly the kind of small deterministic guardrail that local agents need. MarkItDown is built for structure-preserving text extraction, not pixel-perfect rendering. S6 That makes it a strong way to catch placeholder text, missing sections, or broken speaker-note content even when the visual layer looks superficially clean.
The tradeoff is that content QA can still pass while the slide remains ugly. The mitigation is to pair structural text QA with image-based inspection instead of pretending one can replace the other.
This is why I call the session a threshold event. The agent did not merely generate. It generated, verified, repaired, and verified again.
What Was Different From My Gemma 4 Run
My April 4 Gemma 4 write-up argued that the first things to break under a local quantized agent were exactness, path resolution, spatial arithmetic, and long-context responsiveness. That still broadly holds. What changed with Qwen3.6 was where the system could still recover.
Here is the most honest comparison I can make from the two sessions:
| Observed dimension | April 4: Gemma 4 26B A4B | April 16: Qwen3.6 35B A3B |
|---|---|---|
| Architecture | Dense transformer, standard attention | Hybrid DeltaNet (75%) + attention (25%), MoE |
| Best behavior | Reconnaissance, planning, skill use | Multi-step artifact loop with QA and repair |
| Failure character | Exactness broke early | Exactness still failed, but recovery was stronger |
| Operational ceiling | Useful first 70 percent | Usable near-complete worker with guardrails |
| Main risk | Silent precision decay | Latency, context pressure, and false confidence |
That does not mean Qwen3.6 “beats” every model on every task. It means this particular model + quant + runtime + harness combination crossed into a more serious class of local work.
Implementation Patterns I Would Keep
If I were productizing this workflow instead of celebrating a one-off run, I would keep these three patterns.
Pattern 1: Artifact loops must have at least two independent QA surfaces
For a slide deck, that means one visual surface and one structural-text surface.
- visual surface: rendered slide images
- structural-text surface: markdown extraction plus placeholder scanning
If both pass, the artifact is much more trustworthy than if the model merely says “looks good.”
Pattern 2: Let the model own repairs, but never let it define completion
The agent should absolutely be allowed to patch missing imports, malformed headers, and broken table definitions. That is where it creates leverage.
But the completion contract must stay external:
- build exits cleanly
- no placeholder residue
- expected slide count present
- key slides pass visual inspection
Without those gates, “self-healing” degenerates into “self-narrating.”
Pattern 3: Context is a budget, not a trophy
By the end of the run, pi’s UI footer showed context usage at 64.2%/128k. That is already enough to change the feel of a local session. It is not the theoretical max context from the model card. It is the working budget the operator feels.
Qwen3.6 officially supports 262,144 tokens natively and can stretch further with YaRN-style scaling. S1 But the same model card also recommends using long-context scaling only when you actually need it, and separately advises maintaining “a context length of at least 128K tokens to preserve thinking capabilities.” S1 That is Qwen explicitly telling you that thinking mode has a minimum context budget. Cut the window too aggressively and you lose the reasoning behavior that makes the model useful for agent work.
My operational rule is simple: treat any local session above roughly 60% working-window occupancy as a warning zone unless you have strong compaction or a staged task plan.
Risks and Mitigations
This is where the article stops being impressed and starts being useful.
Risk 1: Heavy quants make the model better, but not cheap
UD-Q8_K_XL is a high-fidelity choice, and that is exactly why it is large. Unsloth’s page places it in the high-30 GB range. S2 That is acceptable on some machines and ridiculous on others.
Mitigation: choose the quant based on the task’s failure mode, not on abstract benchmark envy. If the workload is artifact-heavy and repair-sensitive, pay for precision. If it is simple recon or rough drafting, drop lower.
Risk 2: Template correctness and sampling parameters are real dependencies
Qwen-family behavior depends on chat-template handling. llama.cpp is explicit that template application is part of inference plumbing, and that it ships a Jinja parser (minja) for this path. S4 The --jinja flag in my launch command is not decorative; it activates the Jinja parser that correctly processes Qwen3.6’s <think> blocks and tool call formatting.
Sampling parameters also matter more than they look. Qwen’s model card recommends distinct parameter sets for different modes: temperature=0.6, presence_penalty=0.0 for precise coding tasks versus temperature=1.0, presence_penalty=1.5 for general tasks. S1 Using the wrong parameters for your workload can produce noticeably different failure rates.
Mitigation: do not treat template flags, role formatting, or sampling parameters as cosmetic. Test a short tool-use conversation with your exact parameter set before you trust a long coding session.
Risk 3: Visual QA can create false confidence
A deck can look clean while still containing missing speaker-note logic, incorrect numbers, or shallow research.
Mitigation:
pair visual QA with content QA and source QA. If the deck is supposed to reflect research, the markdown extracted from the .pptx should still align with the actual evidence base.
Risk 4: Local agent loops still hallucinate authority
Qwen3.6 is benchmark-strong and explicitly agent-oriented. S1 pi is designed to keep working through files, tools, and sessions. S5 Those two facts together create a dangerous side effect: the system can sound operationally mature before it is actually safe to trust.
Mitigation: make every high-value output prove itself through external checks. That includes builds, rendered artifacts, placeholder scans, file diffs, and source-backed claims.
Observability and SLO Model
If you want to run local coding agents seriously, you need an operator contract. Mine would look like this:
| Signal | Target | Why it matters |
|---|---|---|
| First-pass build success | Track, but do not optimize blindly | Low first-pass success is fine if repair success is high |
| Repair success within 2 loops | >= 70% on scoped artifact tasks | Tells you whether the agent can use feedback instead of just generating |
| Placeholder residue | 0 | Fast proxy for unfinished work |
| Expected artifact count | exact match | Prevents silent truncation |
| Visual QA pass rate on checkpoint slides | 100% | Keeps design-sensitive outputs honest |
| Context occupancy warning line | 60% of active working window | Past this, latency and brittleness usually rise |
| Unverified factual claims in final artifact | 0 | Prevents a polished lie from shipping |
That table is deliberately unromantic. You do not have observability if your only telemetry is “the model sounded confident.”
pi already exposes some of the right operator signals in the UI footer: working directory, session name, token/cache usage, cost, context usage, and current model. S5 That is a good start. The missing piece is to connect those interface signals to explicit go/no-go thresholds for artifact work.
Final Take
What happened on April 16, 2026 was not that a local model suddenly became frontier-grade. What happened is more consequential.
A local open-weight stack proved it could carry a real work loop:
- research
- planning
- artifact generation
- build feedback
- visual QA
- structural content QA
- targeted repair
That is enough to change how I think about local agents.
I no longer see them mainly as private autocomplete with a shell. I see them as emerging local workers that are credible on bounded execution loops, provided the loop includes deterministic verification and the operator is disciplined about context, precision, and evidence.
The future is not “the model writes perfect code.” The future is that a local terminal agent can now make a serious first pass at end-to-end knowledge work, including its own QA, on hardware you control.
That future is already here.
It just still needs adult supervision.
