I Ran pi on Qwen3.6 35B A3B via llama-server. It Built the Deck and QA'd Itself

On April 16, 2026, I replaced my earlier local Gemma run with a heavier stack:

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --jinja

The result was not “a slightly better chat model.” The result was a qualitatively different local agent loop.

This time the agent did not stop at repo reconnaissance, rough planning, or code scaffolding. It wrote a research narrative, generated a slide deck module by module, rebuilt after failures, converted the deck to PDF, rasterized slides for visual inspection, read the resulting PNGs, ran text extraction against the .pptx, checked for placeholder residue, and then closed the loop with targeted repairs. That is not AGI. But it is no longer a toy local demo either.

My April 4 write-up on pi + Gemma 4 argued that the floor for local agentic coding had moved. After this Qwen3.6 run, my view is stronger: the floor has moved again. The important shift is not that the model became perfect. The important shift is that a local open-weight stack can now complete a multi-artifact execution loop with its own QA and repair phases, as long as you constrain it with deterministic checks and refuse to confuse fluency with correctness.

Context and Status on April 16, 2026

The official Qwen3.6 model card positions Qwen3.6-35B-A3B as the first open-weight Qwen3.6 variant, explicitly centered on agentic coding, repository-level reasoning, and “thinking preservation.” It lists 35B total parameters with 3B activated, 256 experts, 8 routed plus 1 shared active experts, and a native 262,144-token context window extensible to about 1.01M with YaRN-style scaling. S1

Architecturally, this is not a standard dense transformer. The model card specifies a hybrid layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)). That means 75% of the model’s layers use Gated DeltaNet, a linear attention variant, while only 25% use full quadratic attention. S1 This matters for local inference: linear attention layers are faster and use less memory per token than standard attention, which is part of why a 35B-total model with only 3B activated can run on consumer hardware without collapsing under context pressure.

That positioning matters because this is not just another instruct checkpoint with a marketing paragraph about “agents.” Qwen is publishing benchmark numbers against coding-agent workloads directly in the model card. On the coding-agent slice, it reports 73.4 on SWE-bench Verified, 51.5 on Terminal-Bench 2.0, 68.7 on Claw-Eval Avg, and 52.6 on QwenClawBench. On tool-use benchmarks specifically, it scores 37.0 on MCPMark and 62.8 on MCP-Atlas. S1 One important caveat: the SWE-bench numbers use Qwen’s own internal agent scaffold with bash and file-edit tools at temp=1.0, top_p=0.95, and a 200K context window. S1 These are self-reported scores, not independently reproduced. They still explain why this model family is materially more relevant to terminal-native agent work than a generic chat release, but treat them as directional evidence, not ground truth for your own stack.

Unsloth’s GGUF release is also not incidental. Their Qwen3.6 page explicitly calls out developer-role support for Codex/OpenCode-style terminals and tool-calling improvements, while publishing a full spread of Dynamic 2.0 quants up through 8-bit. On the 8-bit tier, the page lists Q8_0 at 36.9 GB and UD-Q8_K_XL at 38.5 GB, which is heavy but still realistic on high-memory Apple Silicon or a workstation with enough RAM. S2

That combination is what made the run interesting on April 16, 2026:

a model family explicitly optimized for agentic coding S1
a GGUF distribution path designed for local runtimes and tool-calling robustness S2 S7
llama.cpp exposing an OpenAI-compatible local server with direct Hugging Face fetches via -hf S3
pi acting as the thin terminal harness that feeds the model files, tools, skills, images, and shell feedback S5

The future did not arrive because one benchmark number moved. It arrived because the composition finally crossed a practical threshold.

The Exact Stack I Ran

The whole stack ran on a MacBook Pro with an M4 Max and 64 GB unified memory. UD-Q8_K_XL at ~38.5 GB fits comfortably with headroom for the OS, llama-server KV cache, and the post-processing toolchain (soffice, magick, markitdown). That headroom matters: you want the agent loop to finish without swapping, not just to load the weights.

The architecture was simple enough to fit in one terminal, which is part of the point:

flowchart LR
  U["User request"] --> P["pi terminal harness"]
  P --> C["AGENTS.md + skills + session tree"]
  P --> T["read / write / edit / bash"]
  P --> S["llama-server"]
  S --> M["Unsloth Qwen3.6 35B A3B GGUF"]
  T --> B["deck build"]
  B --> V["visual QA"]
  V --> R["repair loop"]
  R --> B

pi is well suited to exactly this kind of experiment because it stays deliberately thin. Its README describes it as a “minimal terminal coding harness,” with four default tools (read, write, edit, bash), session branching, compaction, AGENTS loading, and on-demand skills. S5 What it explicitly does not include is equally important: no MCP, no sub-agents, no plan mode, no permission popups. S5 Those are all features you can add through TypeScript extensions or pi packages, but the core refuses to bake them in. That design tradeoff matters operationally: less hidden orchestration means less mystery when you are debugging a local agent loop.

llama.cpp is the second critical layer. The project documents llama-server -hf <repo> as a direct path to launching an OpenAI-compatible local API, and explicitly frames GGUF + quantized inference as part of its normal operating model. S3 For this run, that meant I could point a terminal agent at a local endpoint without rewriting my harness around a custom runtime protocol.

The third layer is the model-specific template path. Qwen-family chat formatting is not a side issue. The llama.cpp template wiki explains that it reads tokenizer.chat_template metadata and includes a built-in Jinja parser, minja, to apply those templates. S4 That is exactly the category of detail that separates “the model loads” from “the agent behaves correctly over multiple turns.”

Finally, the post-processing stack mattered:

node build.cjs
soffice --headless --convert-to pdf deck.pptx --outdir qa
magick -density 200 qa/deck.pdf -resize 1920x1080 qa/slide-%02d.png
python3 -m markitdown deck.pptx | grep -Ei "xxx|lorem|placeholder|TODO"

That last step is more important than it looks. Microsoft positions MarkItDown as a document-to-Markdown utility for LLM and text-analysis pipelines, and explicitly lists PowerPoint among the supported formats while warning that it is meant for structural text extraction rather than high-fidelity presentation conversion. S6 In other words, it is exactly right for content QA and exactly wrong for visual publishing. That makes it a perfect gate in a self-healing agent loop.

Why Qwen3.6 Changed the Result

The short answer is that Qwen3.6 looks like a model trained with this use case in mind.

The longer answer is that several details stack on top of each other:

Layer	Relevant fact	Why it changed the run
Qwen3.6 core model	35B total, 3B activated, hybrid DeltaNet/attention architecture, 262K native context S1	Linear attention layers reduce memory pressure; MoE activation keeps compute manageable on local hardware
Qwen3.6 API semantics	Preserved thinking traces in agent scenarios reduce “redundant reasoning” and improve “decision consistency” S1	Multi-step execution stays coherent because the model reuses prior reasoning instead of re-deriving it each turn
Unsloth GGUF release	Developer-role support and tool-calling improvements are called out directly S2	The local terminal experience becomes less brittle when role and tool conventions line up
Dynamic 2.0 quantization	Unsloth selectively quantizes every layer using per-layer KL divergence minimization, with a >1.5M-token calibration dataset tuned for chat S7	The quant is not just smaller; it is actively minimizing behavioral drift from the full-precision model
llama.cpp server path	OpenAI-compatible serving plus direct HF fetch keeps the stack boring S3	The harness can focus on task execution instead of runtime glue

What I observed on top of that source-backed stack was this:

the agent could keep a narrative plan in working memory long enough to finish a 16-slide deck
it could accept runtime feedback without collapsing into generic apology mode
it could move from generation to verification to repair without external handholding at every micro-step
it could treat rendered images as part of the task surface rather than as an afterthought

That last point is subtle. The Qwen3.6 model card is for an image-text-to-text model with a vision encoder. S1 llama.cpp also now documents multimodal support in llama-server. S3 I am not claiming a full benchmarked local multimodal stack from one session. I am saying the workflow was operationally multimodal enough to inspect rendered slides and keep moving. That is already strategically important.

The Loop That Crossed the Threshold

The most important part of the run was not the initial generation. It was the closed loop.

From the session log, the agent executed a sequence that looked like this:

Plan the slide deck structure and create slide modules.
Write the modules to disk, one file at a time.
Rebuild the deck.
Notice concrete defects in generated code and patch them.
Convert the .pptx to PDF.
Rasterize the PDF to slide images.
Read slide PNGs back into the agent for visual QA.
Extract markdown from the .pptx and grep for placeholder residue.
Confirm all slides were extracted and then summarize the QA outcome.

That is the real story. Not “the model felt smarter.” The model stayed inside a deterministic artifact loop long enough to make itself useful.

Three specific moments from the session are worth calling out.

1. It repaired structural code defects instead of restarting from scratch

The run did not succeed on the first build. It patched a missing LAYOUT import in 02_exec_summary.cjs, then repaired table header definitions in 05_five_levels.cjs, 10_open_source_tools.cjs, and 14_sdlc_coverage.cjs, rebuilt the deck, and continued. That is a materially better failure mode than the one I saw in the Gemma 4 run, where exactness degraded earlier and more visibly on layout-sensitive work.

The tradeoff is that the repair was still local and tactical. It fixed what the current build surfaced. It did not prove global correctness. The mitigation is to keep the loop instrumented with explicit gates instead of trusting the agent’s own confidence language.

2. It treated visual QA as part of execution, not as a human-only afterthought

After rebuilding the deck, the agent converted the .pptx to PDF, turned the PDF into slide PNGs, listed the generated images, and then read them back one by one. That is the right pattern for agentic artifact work: if the output is visual, the verification surface must also be visual.

The tradeoff is latency. Visual QA turns one artifact into multiple secondary artifacts, and every one of those files consumes wall-clock time, tokens, or both. The mitigation is to reserve visual QA for high-value checkpoints: title slide, dense table slide, visual-summary slide, and final pass.

3. It used text extraction as a second QA channel

The python3 -m markitdown ... | grep ... step is exactly the kind of small deterministic guardrail that local agents need. MarkItDown is built for structure-preserving text extraction, not pixel-perfect rendering. S6 That makes it a strong way to catch placeholder text, missing sections, or broken speaker-note content even when the visual layer looks superficially clean.

The tradeoff is that content QA can still pass while the slide remains ugly. The mitigation is to pair structural text QA with image-based inspection instead of pretending one can replace the other.

This is why I call the session a threshold event. The agent did not merely generate. It generated, verified, repaired, and verified again.

What Was Different From My Gemma 4 Run

My April 4 Gemma 4 write-up argued that the first things to break under a local quantized agent were exactness, path resolution, spatial arithmetic, and long-context responsiveness. That still broadly holds. What changed with Qwen3.6 was where the system could still recover.

Here is the most honest comparison I can make from the two sessions:

Observed dimension	April 4: Gemma 4 26B A4B	April 16: Qwen3.6 35B A3B
Architecture	Dense transformer, standard attention	Hybrid DeltaNet (75%) + attention (25%), MoE
Best behavior	Reconnaissance, planning, skill use	Multi-step artifact loop with QA and repair
Failure character	Exactness broke early	Exactness still failed, but recovery was stronger
Operational ceiling	Useful first 70 percent	Usable near-complete worker with guardrails
Main risk	Silent precision decay	Latency, context pressure, and false confidence

That does not mean Qwen3.6 “beats” every model on every task. It means this particular model + quant + runtime + harness combination crossed into a more serious class of local work.

Implementation Patterns I Would Keep

If I were productizing this workflow instead of celebrating a one-off run, I would keep these three patterns.

Pattern 1: Artifact loops must have at least two independent QA surfaces

For a slide deck, that means one visual surface and one structural-text surface.

visual surface: rendered slide images
structural-text surface: markdown extraction plus placeholder scanning

If both pass, the artifact is much more trustworthy than if the model merely says “looks good.”

Pattern 2: Let the model own repairs, but never let it define completion

The agent should absolutely be allowed to patch missing imports, malformed headers, and broken table definitions. That is where it creates leverage.

But the completion contract must stay external:

build exits cleanly
no placeholder residue
expected slide count present
key slides pass visual inspection

Without those gates, “self-healing” degenerates into “self-narrating.”

Pattern 3: Context is a budget, not a trophy

By the end of the run, pi’s UI footer showed context usage at 64.2%/128k. That is already enough to change the feel of a local session. It is not the theoretical max context from the model card. It is the working budget the operator feels.

Qwen3.6 officially supports 262,144 tokens natively and can stretch further with YaRN-style scaling. S1 But the same model card also recommends using long-context scaling only when you actually need it, and separately advises maintaining “a context length of at least 128K tokens to preserve thinking capabilities.” S1 That is Qwen explicitly telling you that thinking mode has a minimum context budget. Cut the window too aggressively and you lose the reasoning behavior that makes the model useful for agent work.

My operational rule is simple: treat any local session above roughly 60% working-window occupancy as a warning zone unless you have strong compaction or a staged task plan.

Risks and Mitigations

This is where the article stops being impressed and starts being useful.

Risk 1: Heavy quants make the model better, but not cheap

UD-Q8_K_XL is a high-fidelity choice, and that is exactly why it is large. Unsloth’s page places it in the high-30 GB range. S2 That is acceptable on some machines and ridiculous on others.

Mitigation: choose the quant based on the task’s failure mode, not on abstract benchmark envy. If the workload is artifact-heavy and repair-sensitive, pay for precision. If it is simple recon or rough drafting, drop lower.

Risk 2: Template correctness and sampling parameters are real dependencies

Qwen-family behavior depends on chat-template handling. llama.cpp is explicit that template application is part of inference plumbing, and that it ships a Jinja parser (minja) for this path. S4 The --jinja flag in my launch command is not decorative; it activates the Jinja parser that correctly processes Qwen3.6’s <think> blocks and tool call formatting.

Sampling parameters also matter more than they look. Qwen’s model card recommends distinct parameter sets for different modes: temperature=0.6, presence_penalty=0.0 for precise coding tasks versus temperature=1.0, presence_penalty=1.5 for general tasks. S1 Using the wrong parameters for your workload can produce noticeably different failure rates.

Mitigation: do not treat template flags, role formatting, or sampling parameters as cosmetic. Test a short tool-use conversation with your exact parameter set before you trust a long coding session.

Risk 3: Visual QA can create false confidence

A deck can look clean while still containing missing speaker-note logic, incorrect numbers, or shallow research.

Mitigation: pair visual QA with content QA and source QA. If the deck is supposed to reflect research, the markdown extracted from the .pptx should still align with the actual evidence base.

Risk 4: Local agent loops still hallucinate authority

Qwen3.6 is benchmark-strong and explicitly agent-oriented. S1 pi is designed to keep working through files, tools, and sessions. S5 Those two facts together create a dangerous side effect: the system can sound operationally mature before it is actually safe to trust.

Mitigation: make every high-value output prove itself through external checks. That includes builds, rendered artifacts, placeholder scans, file diffs, and source-backed claims.

Observability and SLO Model

If you want to run local coding agents seriously, you need an operator contract. Mine would look like this:

Signal	Target	Why it matters
First-pass build success	Track, but do not optimize blindly	Low first-pass success is fine if repair success is high
Repair success within 2 loops	`>= 70%` on scoped artifact tasks	Tells you whether the agent can use feedback instead of just generating
Placeholder residue	`0`	Fast proxy for unfinished work
Expected artifact count	exact match	Prevents silent truncation
Visual QA pass rate on checkpoint slides	`100%`	Keeps design-sensitive outputs honest
Context occupancy warning line	`60%` of active working window	Past this, latency and brittleness usually rise
Unverified factual claims in final artifact	`0`	Prevents a polished lie from shipping

That table is deliberately unromantic. You do not have observability if your only telemetry is “the model sounded confident.”

pi already exposes some of the right operator signals in the UI footer: working directory, session name, token/cache usage, cost, context usage, and current model. S5 That is a good start. The missing piece is to connect those interface signals to explicit go/no-go thresholds for artifact work.

Final Take

What happened on April 16, 2026 was not that a local model suddenly became frontier-grade. What happened is more consequential.

A local open-weight stack proved it could carry a real work loop:

research
planning
artifact generation
build feedback
visual QA
structural content QA
targeted repair

That is enough to change how I think about local agents.

I no longer see them mainly as private autocomplete with a shell. I see them as emerging local workers that are credible on bounded execution loops, provided the loop includes deterministic verification and the operator is disciplined about context, precision, and evidence.

The future is not “the model writes perfect code.” The future is that a local terminal agent can now make a serious first pass at end-to-end knowledge work, including its own QA, on hardware you control.

That future is already here.

It just still needs adult supervision.

Context and Status on April 16, 2026#

The Exact Stack I Ran#

Why Qwen3.6 Changed the Result#

The Loop That Crossed the Threshold#

1. It repaired structural code defects instead of restarting from scratch#

2. It treated visual QA as part of execution, not as a human-only afterthought#

3. It used text extraction as a second QA channel#

What Was Different From My Gemma 4 Run#

Implementation Patterns I Would Keep#

Pattern 1: Artifact loops must have at least two independent QA surfaces#

Pattern 2: Let the model own repairs, but never let it define completion#

Pattern 3: Context is a budget, not a trophy#

Risks and Mitigations#

Risk 1: Heavy quants make the model better, but not cheap#

Risk 2: Template correctness and sampling parameters are real dependencies#

Risk 3: Visual QA can create false confidence#

Risk 4: Local agent loops still hallucinate authority#

Observability and SLO Model#

Final Take#

Source Mapping#