Skip to main content

Command Palette

Search for a command to run...

Building AI Copilots Inside React Apps

Contextual sidekicks, memory, streaming, and tool calling. The architecture behind an AI feature that actually understands what your user is doing.

Updated
22 min read
V
Senior frontend engineer with 5+ years building React systems that go beyond CRUD: real-time visualisation at 1kHz, Electron orchestration around native SDKs, WebGPU rendering, and AI-aware design systems. Conference speaker on React performance and prompt-driven systems. Author of the React Beyond UI series — 16 deep-dive posts on frontend systems engineering.

A chatbot floats off to the side and answers questions. A copilot sits next to you and helps you finish the task.

The difference is contextual awareness. A chatbot has no idea which page you're on, which row you've selected, or what you typed three minutes ago. A copilot does. That changes everything about how it's built, what it can do, and how careful you need to be about the data it sees.

Tools like Cursor, GitHub Copilot Chat, Notion AI, Linear's assistant, Claude Code, and Raycast AI all share this shape: an LLM that lives inside the app, knows what's happening, and helps the user without making them re-explain context every turn.

This post is the systems-engineering view of how that pattern actually gets built into a React app. The architecture, the memory, the tool calls, the security boundary, the latency tricks. Everything you need to ship something that isn't a glorified chatbot.

What's a Copilot, Specifically?

Three properties separate a copilot from a chatbot.

Contextual awareness. The copilot knows what the user is looking at: the current route, the selected item, the open document, the visible filters. It doesn't have to be told. It listens.

Persistent memory. Conversation history isn't a list of messages flushed on reload. It's a tiered memory system: working memory for the current task, session memory across the session, long-term memory across sessions (when the user opts in).

Actionable tools. The copilot can do things, not just talk about them. Open a record, apply a filter, draft an email, generate code. Anything the user could do, the copilot can do, scoped by permissions.

A chatbot is a stateless API call wrapped in a chat UI. A copilot is a stateful system that integrates with your app. The engineering difference is significant.

Who's Already Doing This?

A few production copilots worth studying:

  • Cursor and Windsurf: chat sidebar with whole-codebase context, file edits via tool calls, persistent project memory.
  • GitHub Copilot Chat: file-aware chat in VS Code, JetBrains, and the browser.
  • Claude Code: terminal-based copilot for software engineering with a tool-calling loop over your filesystem.
  • Notion AI: inline "ask AI" plus a sidebar that knows your workspace.
  • Linear's AI: knows the issue you're on, can summarise threads, draft comments, link related work.
  • Raycast AI: macOS launcher with AI commands that operate on selected text or open apps.
  • Vercel v0 chat: knows the page you're building, can iterate on existing components.
  • Klarna and Intercom Fin: customer-support copilots embedded in real product flows.

Different domains, same skeleton. The skeleton is what we're building.

The Anatomy of a Copilot

Strip the framework jargon away and a copilot is six pieces, each independently swappable.

The Anatomy of a Copilot diagram

The arrows go both ways for a reason. The UI feeds the model (user input). The model feeds the UI (streamed tokens). Tools dispatched by the model produce side effects that the UI reflects. Memory is read on every turn and written on every turn. Context is captured continuously, not just at submit time.

If your copilot is missing any of these six, you have a chatbot with a bigger system prompt.

Think of It Like a Pair Programmer

A new engineer joins your team. They sit next to you. They can see your screen, hear what you're saying, read the ticket, see the failing test. You don't have to re-explain every five minutes.

  • The screen they see is the context.
  • The conversation you've been having is the working memory.
  • What they remember from yesterday is the session memory.
  • What they know about the codebase is the long-term memory.
  • What they can do (run tests, push commits) is the tool surface.

A great pair programmer is most of those things, mostly without being asked. So is a great copilot. The engineering question is which of those properties you bake in deliberately versus hope the model figures out from a system prompt.

Pillar 1: Where the Copilot Lives in the React Tree

Three patterns, each with different UX implications.

Sidebar. The copilot lives in a persistent panel on the right. Always available, doesn't interrupt. Cursor, Linear, VS Code Copilot Chat all live here.

function AppShell() {
  const [copilotOpen, setCopilotOpen] = useState(true);
  return (
    <div className="grid grid-cols-[1fr,auto] h-screen">
      <MainContent />
      {copilotOpen && (
        <aside className="w-96 border-l">
          <CopilotPanel />
        </aside>
      )}
    </div>
  );
}

Modal or command palette. Triggered by keypress, focused on a single task, dismissed when done. Raycast, GitHub's ? palette, ChatGPT's quick-action menu. Good for "one-shot" interactions.

function CopilotCommandPalette() {
  const [open, setOpen] = useState(false);
  useHotkeys("mod+k", () => setOpen(true));
  return (
    <Dialog open={open} onClose={() => setOpen(false)}>
      <CopilotPanel autofocus />
    </Dialog>
  );
}

Inline. The copilot lives next to the thing it's helping with. A "draft with AI" button next to a text field, an "explain this" tooltip on a chart. Notion's slash menu, Linear's ++ to AI-complete an issue title.

function IssueTitleInput() {
  const [title, setTitle] = useState("");
  const [aiOpen, setAiOpen] = useState(false);
  return (
    <div className="flex gap-2">
      <input value={title} onChange={(e) => setTitle(e.target.value)} />
      <button onClick={() => setAiOpen(true)}>Draft with AI</button>
      {aiOpen && (
        <InlineCopilot
          initialPrompt={`Draft an issue title for: ${title}`}
          onAccept={setTitle}
          onClose={() => setAiOpen(false)}
        />
      )}
    </div>
  );
}

Most real apps end up using all three. Sidebar for ongoing assistance, command palette for "do a thing now," inline for help with a specific field.

The architectural question isn't which one to pick. It's how to share the underlying copilot state across all three so the user doesn't lose context jumping between them.

Pillar 2: Contextual Awareness

A copilot that doesn't know what you're doing is just a chatbot. Context is captured continuously and passed to the model on every turn.

A simple React provider that listens to the app state and exposes it to the copilot:

type CopilotContext = {
  route: string;
  selectedId: string | null;
  visibleData: unknown;
  recentActions: Action[];
};

const CopilotCtx = createContext<CopilotContext | null>(null);

export function CopilotProvider({ children }: { children: React.ReactNode }) {
  const route = useRoute();
  const selectedId = useSelectedId();
  const visibleData = useVisibleData();
  const recentActions = useRecentActions(10);

  const value = useMemo(
    () => ({ route, selectedId, visibleData, recentActions }),
    [route, selectedId, visibleData, recentActions],
  );

  return <CopilotCtx.Provider value={value}>{children}</CopilotCtx.Provider>;
}

export function useCopilotContext() {
  const ctx = useContext(CopilotCtx);
  if (!ctx) throw new Error("useCopilotContext must be used inside CopilotProvider");
  return ctx;
}

The context is read inside the copilot before each turn and serialised into the model's system prompt:

function buildSystemPrompt(ctx: CopilotContext): string {
  return `You are an assistant inside an app.

Current route: ${ctx.route}
Selected item: ${ctx.selectedId ?? "none"}
Visible data: ${JSON.stringify(ctx.visibleData).slice(0, 2000)}
Recent user actions:
\({ctx.recentActions.map((a) => `- \){a.kind} at ${a.ts}`).join("\n")}

Help the user with their current task. Use the available tools when appropriate.`;
}

Three rules for context.

Snapshot, don't subscribe. Capture the context at the moment the user sends a message. If the model takes 4 seconds to respond and the user navigates in the meantime, the model should still answer about what they had open when they asked.

Cap the payload. A 50,000-row table is not context. The first 50 rows are. Apply sensible limits. Models don't need everything, they need enough.

Strip private data at the boundary. PII, tokens, internal IDs you don't want logged. Filter them out before serialising. More on this in the security section.

Pillar 3: Conversation Memory

Memory has three tiers, each with different lifetimes and retrieval rules.

Pillar 3: Conversation Memory diagram

Working memory is the messages in the current turn. Just the conversation as it stands. Lives in React state, flushed when you reset.

Session memory is the conversation across the current visit. Lives in the browser, often in sessionStorage or IndexedDB. Survives a route change but not a browser close.

Long-term memory is what the model knows about the user across sessions. The user's preferences ("I prefer concise summaries"), facts ("my project is called X"), past decisions ("last week we decided to use Tailwind"). Lives in a server, indexed by user ID.

A practical store that handles all three tiers:

type Message = { role: "user" | "assistant" | "tool"; content: string; ts: number };

export class MemoryStore {
  private working: Message[] = [];
  private session: Message[] = [];
  private longTermFetcher: (userId: string, query: string) => Promise<string[]>;

  constructor(longTermFetcher: typeof this.longTermFetcher) {
    this.longTermFetcher = longTermFetcher;
    const persisted = sessionStorage.getItem("copilot-session");
    if (persisted) this.session = JSON.parse(persisted);
  }

  append(msg: Message) {
    this.working.push(msg);
    this.session.push(msg);
    sessionStorage.setItem("copilot-session", JSON.stringify(this.session));
  }

  reset() {
    this.working = [];
  }

  async assemble(userId: string, query: string): Promise<Message[]> {
    const longTerm = await this.longTermFetcher(userId, query);
    const longTermMsg: Message = {
      role: "user",
      content: `Background you should know:\n${longTerm.join("\n")}`,
      ts: Date.now(),
    };
    return [longTermMsg, ...this.session.slice(-20), ...this.working];
  }
}

Two things to internalise.

Working memory shrinks; session memory doesn't. The model context window is finite. The working memory keeps the last ~20 messages, the session keeps everything for later retrieval, the long-term gets summarised periodically.

Long-term memory is retrieved, not loaded. You don't pour every fact you've ever stored about a user into the prompt. You retrieve the relevant ones for the current query, the same way you'd do RAG over documents.

The "compress conversation history when it gets too long" pattern is standard. When the working memory exceeds a token budget, summarise the oldest half into a paragraph and replace it:

async function compressIfNeeded(messages: Message[], maxTokens: number) {
  if (estimateTokens(messages) < maxTokens) return messages;
  const half = Math.floor(messages.length / 2);
  const old = messages.slice(0, half);
  const recent = messages.slice(half);
  const summary = await summariseConversation(old);
  return [
    { role: "user", content: `Earlier in this conversation: ${summary}`, ts: Date.now() },
    ...recent,
  ];
}

Pillar 4: Streaming Responses

Streaming is the one thing that changes whether your copilot feels intelligent or sluggish. A user staring at a spinner for 4 seconds and getting a wall of text is different from a user watching the answer type itself out token by token.

The streaming primitive in 2026 is fetch with a streamed body, or the official SDK's stream helper:

function CopilotMessages() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [partial, setPartial] = useState<string>("");

  async function send(prompt: string) {
    const newMessages = [...messages, { role: "user" as const, content: prompt, ts: Date.now() }];
    setMessages(newMessages);
    setPartial("");

    const stream = await client.messages.stream({
      model: "claude-sonnet-4-6",
      messages: newMessages,
      system: buildSystemPrompt(context),
      tools: toolDefinitions,
    });

    let buffer = "";
    for await (const event of stream) {
      if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
        buffer += event.delta.text;
        // Coalesce per frame to avoid 100 setState/s
        scheduleFlush(() => setPartial(buffer));
      }
    }

    setMessages((prev) => [...prev, { role: "assistant", content: buffer, ts: Date.now() }]);
    setPartial("");
  }

  return (
    <div>
      {messages.map((m, i) => <Bubble key={i} message={m} />)}
      {partial && <Bubble message={{ role: "assistant", content: partial, ts: Date.now() }} streaming />}
    </div>
  );
}

Three practical traps when streaming into React.

Don't call setState per token. Tokens arrive faster than the display refreshes. Coalesce updates with requestAnimationFrame so you fire at most 60 state updates a second.

let pending = "";
let scheduled = false;

function scheduleFlush(apply: (value: string) => void) {
  return (next: string) => {
    pending = next;
    if (scheduled) return;
    scheduled = true;
    requestAnimationFrame(() => {
      apply(pending);
      scheduled = false;
    });
  };
}

Don't re-parse markdown on every token. Re-parsing a growing string is O(n²). Memoise the markdown component, parse the full string once per frame, not per token.

const StreamedMarkdown = memo(function StreamedMarkdown({ text }: { text: string }) {
  return <ReactMarkdown>{text}</ReactMarkdown>;
});

Auto-scroll, but only if the user is at the bottom. A user reading the top of a long response doesn't want the view yanked back down every time a token arrives.

useEffect(() => {
  if (!scrollRef.current) return;
  const isAtBottom = scrollRef.current.scrollHeight - scrollRef.current.scrollTop < scrollRef.current.clientHeight + 100;
  if (isAtBottom) scrollRef.current.scrollTop = scrollRef.current.scrollHeight;
}, [partial]);

Get all three right and your copilot feels alive. Get any of them wrong and it feels broken in a way users can't quite articulate.

Pillar 5: Tool Calling Inside the App

Tools are how a copilot does things, not just talks about them. In a React app, "doing things" means changing app state, navigating, calling APIs, modifying documents.

The tool dispatch flow:

Pillar 5: Tool Calling Inside the App diagram

A tool registry inside the React app:

import { z } from "zod";

const FilterIssues = {
  name: "filter_issues",
  description: "Filter the visible issues by a column and value.",
  schema: z.object({
    column: z.enum(["status", "assignee", "label"]),
    value: z.string(),
  }),
  destructive: false,
  handler: async (args: { column: "status" | "assignee" | "label"; value: string }) => {
    issuesStore.setFilter(args.column, args.value);
    return { ok: true };
  },
};

const CreateIssue = {
  name: "create_issue",
  description: "Create a new issue with a title and optional body.",
  schema: z.object({
    title: z.string().min(1),
    body: z.string().optional(),
  }),
  destructive: true,
  handler: async (args: { title: string; body?: string }) => {
    const issue = await api.issues.create(args);
    navigate(`/issues/${issue.id}`);
    return { ok: true, id: issue.id };
  },
};

const registry = { filter_issues: FilterIssues, create_issue: CreateIssue } as const;

Dispatcher with validation, confirmation gating for destructive tools, and an audit log:

async function dispatch(name: string, rawArgs: unknown, ui: CopilotUI) {
  const tool = registry[name];
  if (!tool) return { ok: false, error: `unknown tool: ${name}` };

  const parsed = tool.schema.safeParse(rawArgs);
  if (!parsed.success) {
    return { ok: false, error: parsed.error.message };
  }

  if (tool.destructive) {
    const approved = await ui.askConfirmation({
      title: `Run ${name}?`,
      args: parsed.data,
    });
    if (!approved) return { ok: false, error: "user declined" };
  }

  audit.log({ name, args: parsed.data, ts: Date.now() });
  return tool.handler(parsed.data as never);
}

Three things to make non-negotiable.

Schema-validate every tool argument. The model will, occasionally, send strings where you wanted numbers. Reject confidently.

Gate destructive tools behind a user confirm. "Delete record", "send message", "modify settings". The model proposes. The user approves. Always.

Audit-log every dispatch. Name, args, result, latency. When something goes weird in production (and it will), you want a paper trail.

Pillar 6: RAG for App State and Docs

When the user's context exceeds what fits in the model's window, you need retrieval. A copilot inside a project-management app has thousands of issues, hundreds of pages of docs, dozens of past decisions. You can't paste all of it into the prompt.

Pillar 6: RAG for App State and Docs diagram

Three things you'll want to index:

  • Static docs (product docs, runbooks, FAQ entries).
  • App data (issues, customers, orders) updated as records change.
  • Conversation history and user facts (long-term memory).

A practical indexing pattern using pgvector on Postgres (cheap, scales fine for most apps):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE doc_chunks (
  id          uuid PRIMARY KEY,
  source_id   text NOT NULL,
  source_type text NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536) NOT NULL,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_cosine_ops);
async function retrieve(query: string, k = 6): Promise<DocChunk[]> {
  const embedding = await embed(query);
  const rows = await sql`
    SELECT id, source_id, content
    FROM doc_chunks
    ORDER BY embedding <=> ${pgvector(embedding)}
    LIMIT ${k}
  `;
  return rows;
}

Pull the top-k results into the system prompt as context:

async function buildPrompt(userMsg: string, ctx: CopilotContext) {
  const docs = await retrieve(userMsg, 6);
  return [
    {
      role: "system",
      content: `You are an assistant inside ${ctx.appName}.

Relevant context:
\({docs.map((d, i) => `[\){i + 1}] (\({d.source_type} \){d.source_id})\n${d.content}`).join("\n\n")}`,
    },
    ...history,
    { role: "user", content: userMsg },
  ];
}

Two caveats.

Re-embed on edit. If a doc changes, its embedding is stale. Re-embed in the background after a write. Track updated_at so you can sweep older entries.

Cache aggressively. Embedding is the slow part. Cache prompt embeddings keyed by exact text. Repeat queries are common in chat.

Demo: AI Dashboard Assistant

Concrete. A dashboard with a copilot sidebar. The user asks "Why are sign-ups down this week?" and the copilot:

  1. Reads the visible chart (context).
  2. Retrieves last week's launch notes and incident reports (RAG).
  3. Streams a narrative answer.
  4. Calls filter_timeline to highlight the deployment that landed on Tuesday (tool).

The shape, end to end:

function DashboardPage() {
  const data = useDashboardData();
  return (
    <CopilotProvider>
      <DashboardLayout
        main={<DashboardCharts data={data} />}
        sidebar={<CopilotSidebar />}
      />
    </CopilotProvider>
  );
}

function CopilotSidebar() {
  const ctx = useCopilotContext();
  const { messages, partial, send } = useCopilot({
    systemPromptBuilder: (ctx) => buildSystemPrompt(ctx),
    tools: registry,
  });
  return (
    <div className="flex flex-col h-full">
      <Messages messages={messages} partial={partial} />
      <PromptInput onSubmit={send} />
    </div>
  );
}

The useCopilot hook holds the streaming + dispatch loop, the memory store, and the tool runner. The page composition stays clean.

Demo: Code Generation Panel

A different shape. A panel where the user types a description and gets a working component scaffold dropped into the editor. Cursor and v0 use this pattern.

function CodePanel() {
  const [code, setCode] = useState<string>("");
  const editor = useEditor();
  const [streaming, setStreaming] = useState(false);

  async function generate(prompt: string) {
    setStreaming(true);
    const stream = await client.messages.stream({
      model: "claude-sonnet-4-6",
      system: "Output only valid React + TypeScript code. No commentary.",
      messages: [{ role: "user", content: prompt }],
    });
    let buffer = "";
    for await (const event of stream) {
      if (event.type === "content_block_delta") {
        buffer += event.delta.text;
        scheduleFlush(setCode)(buffer);
      }
    }
    setStreaming(false);
  }

  return (
    <div>
      <PromptInput onSubmit={generate} disabled={streaming} />
      <CodeBlock code={code} />
      <button disabled={!code} onClick={() => editor.insertAtCursor(code)}>
        Insert into editor
      </button>
    </div>
  );
}

Streaming directly into a <CodeBlock> (a syntax-highlighted preview) gives the user a sense of progress and lets them cancel early if the output is going the wrong way.

Demo: Contextual Form Assistant

The inline pattern. A "help me write this" button next to a text area that opens a small copilot specifically for that field, with context about the form and what the field is for.

function IssueForm() {
  const [title, setTitle] = useState("");
  const [body, setBody] = useState("");

  return (
    <form>
      <FieldWithAssistant
        label="Title"
        value={title}
        onChange={setTitle}
        assistantContext="A clear, action-oriented issue title under 80 characters."
      />
      <FieldWithAssistant
        label="Body"
        value={body}
        onChange={setBody}
        assistantContext={`Body of a software engineering issue. Reference the title: "${title}".`}
      />
      <button type="submit">Create</button>
    </form>
  );
}

function FieldWithAssistant({ label, value, onChange, assistantContext }: Props) {
  const [open, setOpen] = useState(false);
  return (
    <div>
      <label>{label}</label>
      <textarea value={value} onChange={(e) => onChange(e.target.value)} />
      <button type="button" onClick={() => setOpen(true)}>Draft with AI</button>
      {open && (
        <InlinePopover>
          <InlineCopilot
            seed={value}
            context={assistantContext}
            onAccept={(text) => { onChange(text); setOpen(false); }}
            onClose={() => setOpen(false)}
          />
        </InlinePopover>
      )}
    </div>
  );
}

The trick that makes this feel right: the inline copilot has its own context (just this field), not the whole conversation. Less is more when the surface is small.

The Repo Layout

What the source tree looks like for a copilot-enabled React app. Reusable across the three demos above.

my-copilot-app/
├── packages/
│   ├── copilot-core/
│   │   ├── src/
│   │   │   ├── client.ts          # LLM SDK wrapper with streaming
│   │   │   ├── memory.ts          # MemoryStore (working, session, long-term)
│   │   │   ├── context.ts         # CopilotProvider + useCopilotContext
│   │   │   ├── tools.ts           # Tool registry + dispatcher
│   │   │   ├── retrieval.ts       # RAG helpers
│   │   │   └── use-copilot.ts     # the main hook
│   │   └── package.json
│   ├── copilot-ui/
│   │   ├── src/
│   │   │   ├── CopilotSidebar.tsx
│   │   │   ├── CopilotPalette.tsx
│   │   │   ├── InlineCopilot.tsx
│   │   │   ├── Messages.tsx
│   │   │   ├── StreamedMarkdown.tsx
│   │   │   └── PromptInput.tsx
│   │   └── package.json
│   └── copilot-server/
│       ├── src/
│       │   ├── api/
│       │   │   ├── stream.ts      # POST /api/copilot/stream
│       │   │   ├── memory.ts      # GET/PUT /api/copilot/memory/:userId
│       │   │   └── retrieve.ts    # POST /api/copilot/retrieve
│       │   ├── audit.ts
│       │   └── auth.ts
│       └── package.json
├── apps/
│   └── web/
│       ├── src/
│       │   ├── pages/
│       │   │   ├── dashboard.tsx  # Demo: dashboard assistant
│       │   │   ├── editor.tsx     # Demo: code generation
│       │   │   └── issues/new.tsx # Demo: contextual form
│       │   └── tools/             # app-specific tools
│       └── package.json
└── turbo.json

The split between copilot-core (logic), copilot-ui (components), and copilot-server (API + RAG) means you can rewire the LLM provider without touching the React layer, or swap the UI without rewriting the orchestration.

Latency Optimisation

A copilot that takes 6 seconds to start typing feels broken. The targets I aim for:

  • First token under 800ms. Anything more and the user starts to wonder.
  • Stream cadence above 30 tokens per second. Faster than the user reads.
  • Tool dispatch under 200ms (excluding the tool's own work).
  • Total response under 8 seconds for a typical question.

Five tricks that get there.

Stream from the edge. Run the streaming endpoint on edge functions (Cloudflare Workers, Vercel Edge) close to the user. First-byte latency drops dramatically.

Parallel tool calls. Modern models can request multiple tools in a single turn. Run them with Promise.all, not in a loop:

const results = await Promise.all(
  toolCalls.map((call) => dispatch(call.name, call.args, ui)),
);

Speculative streaming. Start the response stream while the retrieval is still in flight. If the retrieval comes back fast enough, fold it into the prompt before the model starts producing output. Riskier but feels instant when it works.

Cache embeddings. Same prompt twice in a session, same embedding. Cache aggressively, evict by LRU.

Pre-warm the context. When the user opens the copilot panel, fire off a tiny "warm up" request that prefetches their long-term memory and recent docs. By the time they finish typing, the retrieval is already done.

Security Boundaries

A copilot sees more of your app than any other feature. The blast radius if it leaks is large. Five rules.

Filter PII at the boundary. Before serialising context into a prompt, strip emails, phone numbers, internal IDs you don't want logged.

function sanitise(data: unknown): unknown {
  if (typeof data === "string") {
    return data
      .replace(/\b[\w.-]+@[\w-]+\.[\w.-]+\b/g, "[email]")
      .replace(/\b\d{10,}\b/g, "[number]");
  }
  if (Array.isArray(data)) return data.map(sanitise);
  if (data && typeof data === "object") {
    return Object.fromEntries(
      Object.entries(data).map(([k, v]) => [k, sanitise(v)]),
    );
  }
  return data;
}

Authenticate every tool call server-side. The tool dispatcher in the client says "trust me, the user wants to delete this." The server doesn't trust the client. It re-checks who the user is and whether they have permission for this action.

Scope tools by route. A tool that's safe in /settings is dangerous on /admin. Wire the available tool set to the current route, not globally.

Audit-log every tool dispatch. Who ran what, with what args, with what result. Store it. When something looks weird, you'll want this.

Treat model output as untrusted input. If the model generates markdown, run it through a sanitiser. If it generates code, sandbox the preview. If it generates URLs, check the origin before navigating.

The general principle: nothing the model produces gets to act on the system without passing through a layer your code controls. The model proposes. Your code commits.

Streaming Benchmarks

A few real numbers from production-style copilot setups. Targets, not promises. Your mileage varies with model, hosting, and prompt size.

Metric Target What good looks like
First-token latency < 800ms 400-600ms (streamed from edge, prompt under 4k tokens)
Stream cadence > 30 tok/s 50-80 tok/s for Sonnet-class models
Tool dispatch latency < 200ms 50-150ms for client-side tools, 200-500ms for API tools
Total response (chat) < 8s 3-6s for a typical 500-token answer
Context window used < 50% 20-40% for a well-pruned session
Tokens per chat turn < 3,000 1,500-2,500 with RAG and trimmed history

If first-token latency is above 1.5s, your prompt is too big or your provider is in a far region. If stream cadence is below 20 tok/s, you're either on a small instance or you're CPU-bound on the client (often markdown re-parsing).

Should You Build a Copilot?

Depends on whether you have a problem that benefits from one.

Yes: information-dense apps where users get stuck on "how do I X." Dashboards, project management, code editors, design tools, anything with thousands of small operations the user has to learn.

Maybe: transactional apps where AI can draft (emails, support replies, summaries). The win is real but bounded.

No: apps where speed is the value. Adding 2 seconds of model latency to a sub-second flow is a regression.

The honest test: would a power user click through faster than typing a prompt? If yes, the copilot is in the way. If no, it earns its place.

Wrapping Up

A copilot isn't a chatbot with extra steps. It's a stateful, contextual, tool-using sidekick that integrates with your app. Six pieces have to be right:

  1. UI that fits where the user needs it (sidebar, palette, or inline).
  2. Context captured continuously and snapshotted at submit time.
  3. Memory in tiers (working, session, long-term) with explicit budgets.
  4. Streaming that coalesces tokens, memoises markdown, and respects the scroll position.
  5. Tools with schema validation, destructive-action gating, and audit logging.
  6. Retrieval over docs and app state when context exceeds the window.

Get those six right and you ship a copilot that feels like a colleague, not a chatbot. Get any of them wrong and users will tell you: "it doesn't really know what I'm doing."

That feedback, more than any benchmark, is the one that matters.

#reactjs #ai #llm #frontend #copilot