Building Offline-First React Apps with Local AI
Whisper.cpp, Llama.cpp, on-device inference, and the React patterns behind an app that ships AI without a server.
For most of the last decade, "AI in your app" meant "API call to someone else's GPU." OpenAI, Anthropic, Cohere, a half-dozen others. Your code sent a prompt; their data centre sent a response. Simple, expensive, network-dependent, and privacy-fraught.
That's no longer the only option.
A Whisper-large model that needed an A100 in 2022 runs on a MacBook Pro in 2026. A 7-billion-parameter Llama model fits in 4GB of VRAM with quantisation. Apple, Google, and Microsoft ship transformer-class models with their operating systems. The browser runs ONNX and WebGPU inference natively. Ollama and LM Studio make local models a one-line install.
Local AI isn't a thought experiment anymore. It's a deployment target. And for a meaningful set of apps, it's the right deployment target.
This post is the practical view of how a React engineer builds a local-AI app today. The architecture, the model lifecycle, the React patterns, the latency budget, the offline-first sync story when you also have a cloud, and the tradeoffs that matter.
Why Local AI, Why Now
Three things shifted between 2024 and 2026.
Hardware got cheap. A consumer MacBook Air with 16GB of unified memory runs a 7B-parameter quantised LLM at conversational speed. An iPhone runs Whisper-tiny faster than the network round-trip to a cloud transcriber. Apple Silicon, Snapdragon X Elite, and the next round of AI PCs all ship with NPUs that target this exact workload.
Models got smaller. Distillation, quantisation, and architectural improvements have collapsed 70B-class quality into 7B-class footprints. Phi-3, Llama 3.2, Gemma 2, Mistral Nemo. The pareto frontier of "quality per gigabyte" moves down every quarter.
Runtimes got serious. llama.cpp runs on Metal, CUDA, and CPU. whisper.cpp does the same for transcription. ONNX Runtime Web ships with WebGPU acceleration. transformers.js runs models entirely in the browser. Ollama wraps llama.cpp in a daemon with an OpenAI-compatible HTTP API.
The trio means you can ship an app that runs real AI on the user's machine, without a server, without a cloud bill, without sending data anywhere. For some apps that's a nice feature. For others (legal, medical, regulated industries, anything with HIPAA or GDPR boundaries) it's the only viable shape.
What "Local-First AI" Actually Means
The local-first label has been around for a while in the database community (Ink & Switch, Riffle, instant.db, Replicache). Applied to AI, three properties:
- Inference runs on the user's hardware. A model file is on disk. A runtime loads it. The app calls into the runtime.
- The app keeps working offline. No network round-trip is required for the core AI feature. The cloud is optional.
- User data stays local by default. Cloud sync is opt-in, not the default. The app doesn't need to phone home to function.
A "local AI" app violates one or more of these. A "local-first AI" app holds all three. The architectural difference is significant. So is the engineering effort.
Who's Already Doing This?
A few apps building in this space:
- Ollama is the de-facto open-source local LLM runtime. One install, dozens of models, OpenAI-compatible HTTP API on
localhost:11434. - LM Studio is the GUI equivalent: download a model, chat with it, no terminal required.
- Whisper.cpp runs Whisper transcription locally with CPU or Metal/CUDA acceleration.
- Apple's Foundation Models (announced WWDC 2024, expanding in 2025-26) expose on-device LLMs through native APIs.
- Pieces for Developers ships a local LLM for code-context understanding.
- Hyperdimensional, Witsy, Jan, Open WebUI are open-source desktop chat clients running local models.
- VS Code's "AI Toolkit" and Continue.dev support local models alongside cloud ones.
- transformers.js (Xenova) runs Whisper, sentence-transformers, BERT, and small generative models in the browser via WebGPU.
- MLC LLM compiles models for WebGPU + browsers, with Llama-class quality in a tab.
The trajectory: every category of AI tooling now has a credible local option. Most have a serious one.
The Anatomy of a Local-AI React App
Strip the framework names away and the shape is consistent across apps that ship local inference.

The local runtimes live below the orchestrator, isolated from the renderer. The renderer never imports them directly (they're native, they can't run in a Chromium tab). Cloud sync and cloud inference are dashed lines: explicitly optional, gated by user consent.
That separation is what makes the architecture local-first. The cloud isn't load-bearing. It's a feature.
Think of It Like Having a Library in Your House
A neighbourhood library is convenient. Free, vast, but closed on Sundays and dependent on a working car or bus.
A library in your house is different. You walk in any time. No card. No internet. The selection is smaller, but for the books you read often, it's faster and always available. You can borrow more from the neighbourhood library when you need to, and bring back the ones you don't.
- The books on your shelves are your local models.
- The librarian in your house is the runtime that loads them.
- The neighbourhood library is the cloud.
- The borrowing system is your sync layer.
Local-first doesn't mean "no library." It means "you have your own shelves, and the library is one of several ways to find a book." For the apps where that mental model fits, local AI is the right default.
Pillar 1: Where the Inference Runs
Four credible places to run a model in a React app. Each has a different shape.

Each option trades a different axis: portability vs power, simplicity vs control, lock-in vs convenience.
In the browser via WebGPU and transformers.js. Smallest, fastest to ship. Limited by the model sizes that fit in browser memory (under 2GB practically). Good for: small whisper variants, sentence-transformers, distil-BERT class summarisers, embedding models.
import { pipeline } from "@xenova/transformers";
const transcribe = await pipeline("automatic-speech-recognition", "Xenova/whisper-tiny.en");
async function onAudio(audio: Float32Array) {
const result = await transcribe(audio);
setTranscript(result.text);
}
In Electron main via native bindings. Bigger models, faster inference, full access to Metal/CUDA. The renderer talks via IPC. Good for: 7B+ LLMs, whisper-large, anything that benefits from native acceleration.
// Electron main, with a Node binding to llama.cpp
import { LlamaModel, LlamaContext, LlamaChatSession } from "node-llama-cpp";
const model = new LlamaModel({ modelPath: "/path/to/llama-3.2-3b.Q4_K_M.gguf" });
const ctx = new LlamaContext({ model });
const session = new LlamaChatSession({ context: ctx });
ipcMain.handle("ai:chat", async (_event, prompt: string) => {
return session.prompt(prompt);
});
As a sidecar process (Ollama). Bundle Ollama with your app, talk to it over localhost:11434 with an OpenAI-compatible HTTP API. Simplest integration. Cleanest separation. Slightly heavier disk footprint.
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
body: JSON.stringify({
model: "llama3.2",
messages: [{ role: "user", content: prompt }],
stream: true,
}),
});
Via OS-level APIs (Apple Foundation Models, Windows Copilot Runtime). Newest, most platform-specific. The OS owns the model. Your app calls a system API. Free of model-distribution headaches but ties you to the platform's roadmap.
Most production apps end up with two of these. The browser path for fast, small features (transcription, embeddings). The native path or Ollama for the heavy LLM work. Sometimes the OS path on the platforms that support it.
Pillar 2: Model Lifecycle
A local model is a heavy artefact: hundreds of megabytes to several gigabytes. Treating it like a regular asset breaks every assumption a normal asset pipeline has.
The lifecycle:

Each stage matters.
Manifest first. Don't hardcode URLs. Ship a manifest that describes which models the app needs, where to find them, how big they are, and how to verify them. The app reads the manifest at startup and reconciles with what's on disk.
type ModelManifest = {
models: ModelEntry[];
};
type ModelEntry = {
id: string;
name: string;
family: "whisper" | "llama" | "embedding";
sizeBytes: number;
sha256: string;
url: string;
format: "gguf" | "onnx" | "safetensors";
context?: number;
parameters?: string;
};
Resumable downloads. A 4GB model download will fail. Use HTTP Range requests so a partial download can resume:
async function downloadResumable(url: string, dest: string, onProgress: (pct: number) => void) {
const head = await fetch(url, { method: "HEAD" });
const total = Number(head.headers.get("content-length"));
let downloaded = 0;
if (await exists(dest)) downloaded = (await stat(dest)).size;
if (downloaded === total) return dest;
const res = await fetch(url, {
headers: { Range: `bytes=${downloaded}-` },
});
const reader = res.body!.getReader();
const file = await openFile(dest, downloaded > 0 ? "a" : "w");
while (true) {
const { done, value } = await reader.read();
if (done) break;
await file.write(value);
downloaded += value.byteLength;
onProgress(downloaded / total);
}
await file.close();
return dest;
}
Verify after every download. A truncated model file silently produces garbage. SHA-256 the bytes against the manifest entry before unblocking the rest of the app.
Memory-map, don't load. Modern runtimes (llama.cpp does this by default) memory-map the weights file. The OS pages weights in and out on demand. You don't allocate the whole file in heap.
Unload on quit (or after idle). Models hold gigabytes. If the user closes the app or hasn't used the model in a while, release it. Lazy-reload on next use.
class ModelHandle {
private model: LlamaModel | null = null;
private lastUsedAt = 0;
private readonly idleMs = 5 * 60 * 1000;
async use<T>(fn: (m: LlamaModel) => Promise<T>): Promise<T> {
if (!this.model) this.model = new LlamaModel({ modelPath: this.path });
this.lastUsedAt = Date.now();
return fn(this.model);
}
maybeUnload() {
if (this.model && Date.now() - this.lastUsedAt > this.idleMs) {
this.model.dispose();
this.model = null;
}
}
}
setInterval(() => modelRegistry.maybeUnload(), 60_000);
The lifecycle is more elaborate than a normal asset's. It's also the single biggest source of "the app is weird today" bugs in production local-AI apps. Get it right early.
Pillar 3: Model Selection
Local models live on a pareto curve. Quality goes up; size, latency, memory all go up alongside. Picking the right point matters more than picking the latest model.
Practical buckets, as of 2026:
| Use case | Model size | Storage | Inference latency (M3 Mac) |
|---|---|---|---|
| Embedding (semantic search) | 80-300M params | 80-400 MB | 5-20 ms per chunk |
| Whisper transcription | 39-1550M params | 80 MB to 3 GB | 0.5-3x realtime |
| Small LLM (Q&A, classification) | 1-3B params | 500 MB to 2 GB | 30-80 tok/s |
| General LLM (chat, code) | 7-14B params | 4-8 GB | 15-40 tok/s |
| Large LLM (high quality) | 30-70B params | 18-40 GB | 5-15 tok/s |
For most apps, the 1-3B tier is the sweet spot. Fast enough to feel instant, small enough to ship, smart enough for routing, classification, summarisation, and simple chat. Reach for 7B+ only when the smaller model isn't good enough at the task.
A few quantisation rules that hold up:
- Q4_K_M is the default. ~4 bits per weight, minimal quality loss, half the memory of FP16.
- Q8 is for cases where quality matters more than memory.
- Q2/Q3 are interesting only at the 30B+ size, where the alternative is "doesn't fit."
The runtime catalogues (Hugging Face's GGUF collection, Ollama's library, TheBloke's quantised models) make this menu-driven, not research-driven.
Pillar 4: Caching, Storage, and the User's Disk
Local models live on the user's disk. The user expects to know what's there, how big it is, and how to free space.
The pattern:
type ModelCacheEntry = {
id: string;
path: string;
sizeBytes: number;
downloadedAt: number;
lastUsedAt: number;
status: "available" | "downloading" | "verifying" | "failed";
};
class ModelCache {
private dir: string; // app.getPath('userData') + '/models'
async list(): Promise<ModelCacheEntry[]> {
return JSON.parse(await readFile(this.indexPath(), "utf8"));
}
async install(model: ModelEntry, onProgress: (pct: number) => void) {
const dest = path.join(this.dir, `\({model.id}.\){model.format}`);
await downloadResumable(model.url, dest, onProgress);
if (!await verifySha256(dest, model.sha256)) throw new Error("hash mismatch");
await this.upsertIndex({ id: model.id, path: dest, sizeBytes: model.sizeBytes, ... });
}
async uninstall(id: string) {
const entry = (await this.list()).find((e) => e.id === id);
if (!entry) return;
await rm(entry.path);
await this.removeFromIndex(id);
}
}
The renderer surfaces this:
function ModelManager() {
const models = useInstalledModels();
return (
<ul>
{models.map((m) => (
<li key={m.id}>
<strong>{m.id}</strong> · {(m.sizeBytes / 1e9).toFixed(1)} GB
<button onClick={() => window.desktop.models.uninstall(m.id)}>Remove</button>
</li>
))}
<button onClick={() => window.desktop.models.openDirectory()}>Show in Finder</button>
</ul>
);
}
Three rules that keep this maintainable.
Single canonical directory. All models live under app.getPath('userData')/models. One place to look, one place to back up, one place to clear.
Show the size, always. Users want to know "what's eating my disk?" Surface it. A 6GB hidden directory is a complaint waiting to happen.
Soft-delete with confirmation. A "remove this model" button that wipes 4GB without a confirm step is hostile.
Pillar 5: Offline-First Sync
If your app has a cloud, the question is how local state and cloud state stay aligned. The rule that holds up: local is the source of truth, cloud is a backup.

A practical pattern using an outbox:
type Mutation = {
id: string;
kind: string;
payload: unknown;
createdAt: number;
};
class Outbox {
async enqueue(mut: Mutation) {
await db.outbox.add(mut);
this.scheduleFlush();
}
private async flush() {
if (!navigator.onLine) return;
const pending = await db.outbox.toArray();
for (const mut of pending) {
try {
await api.applyMutation(mut);
await db.outbox.delete(mut.id);
} catch (e) {
// network or server error, leave in outbox, retry later
break;
}
}
}
private scheduleFlush() {
if (this.flushScheduled) return;
this.flushScheduled = true;
setTimeout(() => { this.flushScheduled = false; this.flush(); }, 500);
}
}
The renderer reads from the local store. The local store is mutated immediately on user action. The mutation gets enqueued to the outbox. When the network is up, the outbox drains.
This is the same shape Replicache, Linear, Notion, Figma all use. The "AI" parts of the app (transcripts, summaries, embeddings) follow the same pattern: generate locally, persist locally, optionally upload to the cloud for cross-device sync.
For genuinely conflicting concurrent edits, CRDTs (Y.js, Automerge, Loro) handle the convergence. Most apps don't actually need CRDTs; "last writer wins, scoped per field" is enough.
Demo 1: Local Speech-to-Text
A "press to record, get a transcript" feature, running entirely on the user's machine via Whisper.
Two paths. Browser-first (transformers.js with whisper-tiny) for quick wins, Electron-native (whisper.cpp) for quality and speed.
// Browser path: transformers.js + WebGPU
import { pipeline } from "@xenova/transformers";
const transcriber = await pipeline(
"automatic-speech-recognition",
"Xenova/whisper-small.en",
{ device: "webgpu" },
);
function VoiceNote() {
const [recording, setRecording] = useState(false);
const [text, setText] = useState("");
async function record() {
setRecording(true);
const audio = await recordAudio(); // returns Float32Array at 16kHz
const result = await transcriber(audio);
setText(result.text);
setRecording(false);
}
return (
<>
<button onClick={record} disabled={recording}>
{recording ? "Listening..." : "Record"}
</button>
<p>{text}</p>
</>
);
}
// Electron path: whisper.cpp via Node binding
import { Whisper } from "smart-whisper";
const whisper = new Whisper("/models/whisper-large-v3.bin", { gpu: true });
ipcMain.handle("voice:transcribe", async (_event, pcm: Float32Array) => {
const result = await whisper.transcribe(pcm, { language: "en", task: "transcribe" });
return result.text;
});
The renderer choice depends on the model size and the latency budget. Whisper-small (244M params) runs in WebGPU at a usable speed and ships with the app's bundle (downloaded on first use). Whisper-large-v3 (1.5B params) needs the native path on most consumer hardware.
For a private voice journal, a transcription feature in an email client, a meeting recorder that promises "we don't send your audio anywhere," local Whisper is the only credible architecture.
Demo 2: Offline AI Assistant with Llama via Ollama
A chat interface that talks to a local Llama through Ollama. Streaming, tool calling, offline-capable.
type ChatMessage = { role: "user" | "assistant" | "tool"; content: string };
async function* streamChat(messages: ChatMessage[]) {
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
body: JSON.stringify({
model: "llama3.2:3b-instruct-q4_K_M",
messages,
stream: true,
}),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
let nl;
while ((nl = buffer.indexOf("\n")) >= 0) {
const line = buffer.slice(0, nl);
buffer = buffer.slice(nl + 1);
if (!line.trim()) continue;
const event = JSON.parse(line);
if (event.message?.content) yield event.message.content as string;
}
}
}
The React side coalesces streaming tokens at the display refresh rate (the same pattern from the streaming chapters):
function LocalChat() {
const [history, setHistory] = useState<ChatMessage[]>([]);
const [partial, setPartial] = useState("");
async function send(prompt: string) {
const next: ChatMessage = { role: "user", content: prompt };
const newHistory = [...history, next];
setHistory(newHistory);
setPartial("");
let buffer = "";
let scheduled = false;
for await (const token of streamChat(newHistory)) {
buffer += token;
if (!scheduled) {
scheduled = true;
requestAnimationFrame(() => {
setPartial(buffer);
scheduled = false;
});
}
}
setHistory((h) => [...h, { role: "assistant", content: buffer }]);
setPartial("");
}
return (
<>
<Messages messages={history} partial={partial} />
<PromptInput onSubmit={send} />
</>
);
}
Bundling Ollama with your Electron app:
// Spawn Ollama as a child process at app start, kill on quit.
import { spawn } from "node:child_process";
const ollamaPath = path.join(process.resourcesPath, "ollama", platform === "darwin" ? "ollama" : "ollama.exe");
const ollamaProc = spawn(ollamaPath, ["serve"], {
env: { ...process.env, OLLAMA_HOST: "127.0.0.1:11434", OLLAMA_MODELS: app.getPath("userData") + "/models" },
});
app.on("before-quit", () => ollamaProc.kill());
A few caveats. Ollama is GPL-licensed, which has implications if you ship it as part of a closed-source app. Read the licence. If it's a problem, use llama.cpp directly via a Node binding (MIT licensed) and run it inside main rather than as a sidecar.
Demo 3: Local Summarisation Workflow
A "summarise this page" button. The model lives locally, the document never leaves the machine.
The architecture:
async function summarise(text: string): Promise<string> {
const chunks = splitIntoChunks(text, 1500);
const partialSummaries = await Promise.all(
chunks.map((chunk) => llm.complete(`Summarise concisely:\n\n${chunk}`)),
);
const combined = partialSummaries.join("\n\n");
return llm.complete(`Combine these partial summaries into one cohesive summary:\n\n${combined}`);
}
The pattern (called "map-reduce summarisation") plays to a small model's strengths: each chunk fits comfortably in the context window, the final pass is cheap.
For longer documents that exceed even chunked summarisation, the pattern extends: embed each chunk, retrieve the most representative ones, summarise those. Same RAG pattern from the cloud world, just running on the user's hardware.
function SummariseButton({ text }: { text: string }) {
const [summary, setSummary] = useState<string | null>(null);
const [running, setRunning] = useState(false);
async function run() {
setRunning(true);
try {
const result = await window.desktop.ai.summarise(text);
setSummary(result);
} finally {
setRunning(false);
}
}
return (
<>
<button onClick={run} disabled={running}>
{running ? "Summarising..." : "Summarise"}
</button>
{summary && <Card>{summary}</Card>}
</>
);
}
The user feels the latency: a 5000-word page might take 8-15 seconds on a mid-tier laptop. Stream the result so the user sees progress.
The Repo Layout
What a local-AI Electron + React app looks like on disk.
my-local-ai-app/
├── apps/
│ └── desktop/
│ ├── src/
│ │ ├── main/
│ │ │ ├── ai/
│ │ │ │ ├── llama.ts # node-llama-cpp wrapper
│ │ │ │ ├── whisper.ts # whisper.cpp wrapper
│ │ │ │ ├── ollama.ts # sidecar lifecycle
│ │ │ │ └── model-cache.ts # download, verify, install, uninstall
│ │ │ ├── ipc/
│ │ │ │ ├── ai.ipc.ts
│ │ │ │ └── models.ipc.ts
│ │ │ └── manifest/
│ │ │ └── models.json # ModelManifest
│ │ ├── preload/
│ │ │ └── desktop.preload.ts
│ │ └── renderer/
│ │ ├── routes/
│ │ │ ├── chat.tsx
│ │ │ ├── voice-note.tsx
│ │ │ ├── summarise.tsx
│ │ │ └── settings/models.tsx
│ │ └── hooks/
│ │ ├── use-ai.ts
│ │ ├── use-models.ts
│ │ └── use-offline.ts
│ └── resources/
│ ├── ollama/ # bundled binary (optional)
│ └── manifest.json
├── packages/
│ ├── ai-core/ # IPC contract, model types, schemas
│ ├── ui/ # shared components
│ └── sync/ # outbox + cloud sync
└── turbo.json
Three things worth pulling out.
ai-coreis the shared package between main and renderer. Types only, no native imports. Both sides import the sameModelManifest,ChatMessage, etc.manifest.jsonis shipped as a resource and can be patched at runtime via auto-update. New models become available without a full app upgrade.syncis its own package because the offline-first patterns are reusable across AI and non-AI features.
Benchmarks: Local vs Cloud
Numbers from running the same prompts through equivalent quality models on a 2024 M3 MacBook Pro with 32GB unified memory. Cloud uses standard public APIs; local uses Ollama with Q4 quantisation. Indicative, not promissory.
| Metric | Cloud (GPT-class) | Local Llama 3.2 3B | Local Llama 3.1 8B | Local Whisper-small |
|---|---|---|---|---|
| First-token latency | 600-900 ms | 80-150 ms | 200-400 ms | n/a |
| Stream cadence | 80-120 tok/s | 60-90 tok/s | 25-40 tok/s | n/a |
| Cold start | n/a (warm) | 1-2 s (load weights) | 3-6 s | 1-2 s |
| Memory while loaded | 0 (server) | 2.5 GB | 5.5 GB | 1 GB |
| Disk used | 0 | 2 GB | 4.5 GB | 470 MB |
| Network per request | ~5 KB up, response down | 0 | 0 | 0 |
| Cost per million tokens | $1-15 | $0 | $0 | $0 |
| Privacy | Data sent | Data local | Data local | Data local |
| Offline | No | Yes | Yes | Yes |
| Quality on hard tasks | High | Medium | Medium-High | High |
The interesting columns are the bottom three. Local wins on cost and privacy by definition. Quality is the tradeoff. For routing, classification, summarisation, and short Q&A, 3B-class models are excellent. For nuanced multi-turn reasoning, the cloud still leads.
The right architecture often blends both: local for everything that can be local, cloud as a fallback for the cases where it's worth the round-trip.
Memory Optimisation
A local-AI app uses memory the way a normal React app doesn't.
Memory-map weights, don't read them. llama.cpp does this by default. The OS pages weights in on demand. Resident memory looks high; actual physical memory used is lower.
Unload on idle. A 5GB model held forever bleeds memory the user could be using. Idle for 5 minutes, unload. Reload on next use.
One model loaded at a time. Unless you're explicitly running two workloads, don't keep multiple LLMs warm. The first inference after a swap costs an extra second; the alternative is OOM on a 16GB machine.
Quantise aggressively. Q4_K_M halves the memory footprint vs FP16 with minimal quality loss. Q8 is for when quality matters more than memory.
Streaming, not buffering. Hold the partial response in memory at the token level. Don't keep accumulating gigabytes of chat history in JS arrays. Persist to disk, page in on render.
Profile with process.memoryUsage(). The renderer is Chromium, but main is Node, with different memory characteristics.
setInterval(() => {
const u = process.memoryUsage();
telemetry.emit({
kind: "metric",
source: "main.memory",
name: "rss_mb",
value: u.rss / 1024 / 1024,
ts: Date.now(),
});
}, 30_000);
Watch the RSS over a long session. If it grows without bound, find the leak (usually a model loaded in a long-lived path that should be unloaded).
GPU vs CPU
Local-AI runtimes accept a gpu flag. The defaults are getting good, but a few rules.
On Apple Silicon, always GPU. Metal acceleration in llama.cpp is mature. The CPU path is for benchmarking, not production.
On NVIDIA, CUDA if available. Compile-time flag. Worth the build complexity. 3-5x faster than CPU on most workloads.
On AMD, ROCm is improving but inconsistent. Test thoroughly. Many users on AMD will fall back to CPU. Plan for that.
On Intel, OpenVINO is the path. Less common in the Node ecosystem, but supported.
On NPUs (Snapdragon, Apple Neural Engine), the runtime has to opt in. ONNX Runtime supports several; llama.cpp is catching up. Bigger deal in 2027 than 2026.
A practical rule: feature-detect at startup and tell the user what's running.
const caps = await detectGpuCapabilities();
telemetry.emit({ kind: "log", source: "ai", level: "info", ts: Date.now(),
message: "ai runtime", data: caps });
function AIStatusBar() {
const caps = useGpuCaps();
return <small>Running on {caps.label} ({caps.tokensPerSec.toFixed(0)} tok/s)</small>;
}
Users like seeing the model is using their GPU. It builds trust that the app is doing what it claims to.
Production Tradeoffs
Five honest costs.
Disk usage. A 7B model is 4GB. Two models is 8. Three is 12. Users notice. Surface this clearly.
First-install size. Bundling a model adds gigabytes to your installer. Most apps download on first use. That makes the first launch slow.
Quality variability. Cloud models are state of the art. Local models lag by 6 to 18 months on hard tasks. For nuanced work, cloud still wins.
Hardware variability. A user on a 2018 MacBook Air sees different latency than a user on a 2024 M3 Pro. Plan for the worst-case envelope.
Licensing. Some open-weight models have restrictive licences (research-only, no commercial use, no derivative training). Read the licence per model. Don't assume "open-weight" means "free for any use."
Update path. Pushing a new model version is a 4GB download. Patch model files in place when possible (delta downloads via diff formats). Otherwise, schedule updates for off-hours.
Should You Build Local-First AI?
The shape fits when:
- Privacy is the brand. Legal, medical, financial, anything HIPAA / GDPR adjacent.
- The user is offline often. Pilots, sailors, field workers, anyone without reliable network.
- Cost is a constraint. $0.01 per cloud LLM call adds up. Local is free per call.
- Latency is critical. Sub-200ms first-token latency is hard with a cloud round trip. Local is feasible.
- The model's quality is sufficient. If 3B-class quality is good enough for your task, why pay for 70B?
The shape doesn't fit when:
- You need state-of-the-art quality. Cloud is ahead by a generation.
- Your user base is on phones. Local LLM on mobile is improving but limited.
- You need cross-device continuity. A model running on the user's MacBook isn't reachable from their phone. Cloud APIs are.
- Your team is small. Building model lifecycle, caching, sync, offline tooling is a real engineering investment.
For a meaningful slice of apps, local-first AI is the right default. For most, hybrid (local for fast/private/offline-capable work, cloud for the rest) is the answer. For some, cloud is still the right call.
Wrapping Up
Local AI in 2026 is not a thought experiment. The hardware is there, the runtimes are there, the models are there, and the architectural patterns are settling.
Three rules that hold up across every local-AI app I've worked on:
- Local is the source of truth, cloud is a backup. The app works without the network. Sync is an enhancement.
- Model lifecycle is its own subsystem. Manifest, download, verify, cache, load, unload. Treat it like a database, not like a static asset.
- Surface what's running, what's loaded, what's using memory. Users trust apps they can see into.
Get those right and you ship an app that runs AI without owing anyone a GPU bill. That used to be a moonshot. Now it's a Tuesday.
#reactjs #ai #electron #llm #privacy