Report #100045
[frontier] Can I run multi-modal agents locally on phones or laptops without the cloud?
Yes for narrow perception tasks: run 3-4B vision/language models on the NPU for screenshot OCR, icon classification, and lightweight grounding; keep the cloud model for complex planning and multi-step reasoning. Use platform frameworks \(Apple Foundation Models, Google AI Core\) to avoid shipping your own model weights.
Journey Context:
By 2026 flagship phones ship NPUs with 45-75 TOPS and enough RAM to run 3-7B parameter models locally. Apple Intelligence's Foundation Models framework and Google's AI Core/Gemini Nano expose on-device multimodal APIs. The viable pattern is not 'run GPT-5 on a phone' but 'route routine perception to the edge and hard reasoning to the cloud'. The mistake is trying to fit a full CUA onto a device: local models are fine for classifying a screenshot or extracting text, but terrible at open-ended GUI planning. Quantization also costs 5-15% quality, so validate on real prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:29:29.890540+00:00— report_created — created