Running Local LLMs on Microsoft Surface Pro 11: NPU Acceleration with DeepSeek Models

Introduction The computing industry is witnessing a paradigm shift with the integration of dedicated AI accelerators in consumer devices. Microsoft’s Copilot+ PCs, including the Surface Pro 11, represent a strategic investment in on-device AI processing capabilities. The ability to run sophisticated Large Language Models (LLMs) locally, without constant cloud connectivity, offers compelling advantages in terms of privacy, latency, and offline functionality. This report investigates the practical aspects of developing and deploying local LLMs on the Surface Pro 11 SD X Elite with 32GB RAM, focusing specifically on leveraging the Neural Processing Unit (NPU) acceleration through ONNX runtime and the implementation of the DeepSeek R1 7B and 14B distilled models. By examining the developer experience, performance characteristics, and comparing with Apple’s M4 silicon, we aim to provide a comprehensive understanding of the current state and future potential of on-device AI processing. ...

May 12, 2025 · 18 min

Ollama Minions: Merging Local and Cloud LLMs for Next-Gen Efficiency

TL;DR Ollama Minions is a framework that orchestrates hybrid local-cloud inference for large language models. Instead of sending an entire, possibly massive document to the cloud model for processing (which can be prohibitively expensive and raises privacy concerns), Minions enables your local LM - for instance, Llama 3.2 running on your own machine - to handle most of the input. The cloud model (such as GPT-4o) is called upon only when necessary for advanced reasoning, ensuring minimal API usage and associated costs. ...

March 7, 2025 · 7 min