Building an AI Coding Agent for a Design System with Claude Agent SDK

I spent a few weeks earlier this year building a small AI agent that helps me complete my React design system. It's a CLI tool I can run from the terminal, plus an in-app chatbox that floats over the desktop product I work on. I called it AI Polish.

It's 589 lines of TypeScript. That number surprised me. I expected to write a few thousand lines, with bespoke prompt construction, custom tool dispatchers, retry logic, and a streaming protocol. Instead, the Claude Agent SDK shouldered most of that, and what was left for me to write was the boring-but-critical work: the system prompt, the safety boundaries, and the diff-preview UX.

This post is a writeup of the architecture, the decisions that survived, and the patterns I'd reach for again. If you're thinking about building an LLM coding agent for a specific domain (a design system, a codebase migration, an internal DSL), the shape below is a good starting point.

Prerequisites
What AI Polish does
The stack
Two surfaces, one core
The architecture
A stable system prompt for caching
In-process MCP tools
The path allowlist safety gate
Diff preview via bidirectional stdin
Event taxonomy for CLI and UI parity
What I learned
Wrapping up

Prerequisites

This article assumes:

Working knowledge of TypeScript and Node.js. You should be able to read async code and child_process interactions.
Familiarity with React and design systems. You don't need a deep design background, but the article is about a coding agent that builds design system components.
Basic understanding of LLM agents. The agent loop pattern (model decides to call a tool, tool runs, result goes back to model, repeat). You don't need to have built one yourself.
A Claude API key, if you want to follow along. Anthropic's SDK is what this article is built on. Alternative SDKs (OpenAI, Vercel AI SDK) are similar enough that the patterns transfer.

You don't need any prior experience with the Claude Agent SDK specifically. I'll cover the pieces that matter.

What AI Polish does

The product context: I work on a React design system. Primitives (Button, Input, Card), composite components (FormRow, StatCard), tokens (colors, spacing, typography), and a small army of conventions about how those compose. Maintaining the system involves a steady drumbeat of work: adding a new primitive when the design team asks for one, updating a token after a theme refresh, polishing component variants, writing the docs page that goes with each one.

A lot of that work is mechanical. The shape of a new primitive component is dictated by the system: a folder with index.tsx and an optional ConfigName.ts, re-exported from the package index, styled with token references rather than hardcoded values, with a light and dark variant. Once you've written ten of them, you've written all of them, modulo the actual design.

AI Polish is an agent that automates that mechanical work. You give it a task ("add a Tooltip primitive matching the Figma frame X"), it reads the existing patterns in the codebase, drafts the component, shows you a diff, and writes the file on approval. On a typical task it touches three to five files and finishes in 30 to 90 seconds.

There are two ways to invoke it:

CLI: npm run ai -- "add a Tooltip primitive matching tooltip.tsx in Figma". Output streams to the terminal. Good for batch work and scripting.
In-app drawer: Cmd+K opens a drawer inside the desktop app. The drawer shows the current screen's context, sends it as a prompt suffix, and renders diffs as cards you click to Apply or Reject. Good for when you're already in the app and want to ask about what you're looking at.

Both surfaces invoke the same underlying CLI. The drawer just spawns it as a child process per task and renders the streamed events.

The stack

Four dependencies in production:

"@anthropic-ai/claude-agent-sdk": "^0.3.0",
"dotenv": "^16.4.5",
"js-yaml": "^4.1.0",
"zod": "^4.0.0"

The Claude Agent SDK is the load-bearing one. It's not the standard Anthropic SDK; it's a higher-level kit that ships built-in file tools (Read, Write, Edit, Glob, Grep), prompt caching support, an in-process MCP server factory, and an agent mode that gives the model the freedom to plan and call tools iteratively. The standard SDK gives you client.messages.create(). The Agent SDK gives you a working agent.

If you're tempted to write a custom agent loop on top of the standard SDK: I tried that first. It works, but you'll end up re-implementing 80% of what the Agent SDK gives you. The build-versus-buy line on this particular abstraction has moved.

Dev dependencies are minimal too:

"typescript": "^5.9.2",
"tsx": "^4.0.0",
"@types/node": "^22.0.0",
"@types/js-yaml": "^4.0.9"

No build step. The agent runs directly via tsx src/cli.ts. This is one of those small details that compounds: every time I changed the agent, I could npm run ai immediately, no build wait, no source map generation, no esbuild config.

Total disk footprint: 589 lines of source code, plus a 87-line system prompt. The full codebase fits on one screen.

Two surfaces, one core

The architectural choice that made the rest of the design fall into place was: the CLI is the only entry point. The in-app drawer is a renderer for the CLI's streamed events.

Two surfaces, one core diagram

The CLI accepts a task string, configures the Agent SDK, and streams events to stdout (and via IPC to the drawer if it's been spawned that way). Both surfaces consume the same event taxonomy and render it differently: ANSI text in the terminal, React cards in the drawer.

This decision had three downstream benefits:

One code path, one set of bugs. Whatever works in the terminal works in the drawer, and vice versa. There's no parallel implementation to keep in sync.

Testing the agent is testing the CLI. No need to mock the IPC layer; the CLI's stdout is the source of truth.

The CLI is the killer feature for power users. Once the CLI exists, it can be scripted, scheduled, piped into other tools. Some of my heaviest usage has been kicking off a batch of tasks from a shell loop.

The cost: the drawer pays a ~1 second spawn latency per task (tsx startup plus SDK init). For design system work this is invisible because the task itself takes 30 to 90 seconds. For sub-second interactions it would matter; that's a future optimization (roadmap item: long-running daemon mode).

The architecture

The full picture, end to end:

The architecture diagram

The five interesting pieces are: the system prompt, the custom Figma tools, the safety gate, the diff preview, and the event stream. Each gets its own section below.

A stable system prompt for caching

The system prompt is the highest-leverage 87 lines in the entire codebase. It's also the most boring to write and the easiest to get wrong.

What it covers:

The high-level goal (one sentence): help complete the design system library itself.
A stack snapshot (React 19, TypeScript 5.9, Turborepo, Electron).
Where things live (directory tour with ~8 key locations).
Import conventions (the exact internal package paths).
Component file pattern (folder structure, naming).
Theming rules (never hardcode colors, always use the theme hook, support light AND dark).
Typography rules (one font family, specific sizes and weights).
TypeScript rules (strict, no any, named exports).
Style approach (inline sx prop, variants in a Config.ts file).
The expected workflow (plan, read existing patterns, make focused edits, verify themes, wrap up).
Available tools (built-in plus custom Figma tools).
Hard guardrails (write allowlist, no shell commands, no new dependencies).

What it does not contain: few-shot examples. The prompt relies on the codebase itself as the example set. The agent uses Glob and Grep to find existing primitives that look like what the user is asking for, reads them, and uses them as the implicit template.

The reason this prompt is kept stable: the Claude Agent SDK supports prompt caching. When the system prompt is identical across requests, the SDK marks it as cacheable and the model serves it from cache at ~10% of the normal token cost. For a tool that runs many times a day, this is the difference between a \(50/month bill and a \)5/month one.

// src/prompt.ts (extract)
export const systemPrompt = `You are a senior frontend engineer working on a React + Electron design system.

YOUR GOAL: complete the design system library itself. Primitives, composite components, tokens, themes, docs.

WHERE THINGS LIVE:
- packages/ui/primitives/ - base components
- packages/ui/comps/ - composite components
- packages/ui/styles/ - tokens and theme provider
- packages/ui/hooks/ - shared React hooks
- packages/ui/plugins/ - heavier components (Canvas, Charts, Tables)

COMPONENT FILE PATTERN:
- One folder per component: ComponentName/
- index.tsx for the implementation
- componentNameConfig.ts for variant styles (when variants exist)
- Re-export from the package index.ts

THEMING (non-negotiable):
- Never hardcode colors. Always use useTheme() and reference tokens.
- Every component must support light AND dark themes.
- Tokens are defined as { light, dark } pairs.
- Common tokens: primaryFill, surfaceFill, text, statusSuccess, statusError, stroke.

... (continues)
`;

The prompt being a constant .ts export, version-controlled, means I can tweak it once, observe the agent's behavior over the next few runs, and commit changes confidently. The agent never gets a non-deterministic prompt.

A practical lesson: write the prompt as the codebase itself would explain its conventions, not as a generic LLM tutorial. The agent treats the prompt as authoritative context. If it says "use the sx prop," the agent uses the sx prop. If it's vague ("style components nicely"), the agent picks something plausible-but-wrong.

In-process MCP tools

Beyond the built-in file tools (Read, Write, Edit, Glob, Grep), AI Polish has three custom tools for Figma access:

list_figma_dumps: list the pre-cached Figma frames available on disk.
read_figma_dump: read one of those cached frames as YAML.
figma_get_node_by_url: optional live fetch from Figma's REST API (requires a FIGMA_ACCESS_TOKEN).

These are MCP (Model Context Protocol) tools, but they don't run over a network. They run in-process as TypeScript functions, served by the Agent SDK's createSdkMcpServer() helper.

// src/tools/figma.ts (extract)
import { createSdkMcpServer, tool } from "@anthropic-ai/claude-agent-sdk";
import { z } from "zod";

export const figmaTools = createSdkMcpServer({
  name: "figma",
  version: "0.1.0",
  tools: [
    tool({
      name: "list_figma_dumps",
      description: "List pre-cached Figma frame dumps available on disk.",
      inputSchema: z.object({}),
      async execute() {
        const files = await fs.readdir(FIGMA_DUMPS_DIR);
        return { dumps: files.filter((f) => f.endsWith(".yaml")) };
      },
    }),
    tool({
      name: "read_figma_dump",
      description: "Read a cached Figma frame as YAML.",
      inputSchema: z.object({ filename: z.string() }),
      async execute({ filename }) {
        const raw = await fs.readFile(
          path.join(FIGMA_DUMPS_DIR, filename),
          "utf8",
        );
        return { content: yaml.load(raw) };
      },
    }),
    // ... figma_get_node_by_url
  ],
});

The Agent SDK accepts this as a tool source alongside its built-in tools. The model calls them the same way it calls Read or Glob.

This pattern is worth highlighting because the obvious alternative is wrong. The MCP spec defines tools as separate processes communicating over JSON-RPC. That's the right shape when you want one MCP server to be reusable across different agents (the Slack MCP server, the Linear MCP server). For an agent's domain-specific tools, the in-process variant is a much better fit:

No HTTP, no port management, no separate process to keep running.
Zero network latency. Tool calls dispatch to a TypeScript function in the same memory space.
Easier debugging. The whole stack trace is one process.
Easier deployment. The agent is one binary, no MCP server to ship alongside.

For a domain-specific agent, in-process MCP is the default I'd reach for now.

The path allowlist safety gate

The most important safety property of a coding agent is that it can't write to places it shouldn't. AI Polish enforces this with a hard allowlist in safety.ts:

// src/safety.ts (extract)
const WRITE_ALLOWED = [
  "packages/ui/**",
  "apps/docs/content/**",
];

const READ_ALLOWED = [
  "packages/ui/**",
  "apps/docs/content/**",
  "apps/e2e-tests/design-diff/figma-dumps/**",
  "CLAUDE.md",
];

export function isWriteAllowed(filePath: string): boolean {
  const normalized = normalize(filePath);
  if (normalized.includes("..")) return false;
  return WRITE_ALLOWED.some((pattern) => minimatch(normalized, pattern));
}

The allowlist is enforced via the Agent SDK's PreToolUse hook. Before any Write or Edit tool call executes, the hook runs. If the target path isn't in the allowlist, the call is denied with a structured reason, and the model receives that as the tool result.

The path allowlist safety gate diagram

Two properties of this pattern that matter:

It fails fast. A denied write returns to the model as a tool result with a reason. The model reads it and adapts (tries a different path, asks for clarification, etc.). This is much better than silently succeeding into a wrong directory.

It composes with the diff preview. Even when a write is allowed, the user still sees the diff before the file changes. The allowlist is the floor of safety, not the ceiling.

This is the pattern I'd recommend for any code-writing agent: enforce safety boundaries at tool-use time, not at result time, and let the model see and react to denials.

Diff preview via bidirectional stdin

The diff preview is the UX feature that makes AI Polish feel safe to use. Every Write and Edit, the agent pauses, the in-app drawer renders a diff card, and the user clicks Apply or Reject.

The technical challenge: how does the agent (a child process) ask the drawer (the parent process) a question and wait for the answer?

The obvious approach is some kind of two-way IPC: HTTP, WebSocket, or a named pipe. I tried that first and found it overengineered.

The approach that actually shipped uses stdin and stdout, in both directions:

The CLI emits one-way events on stdout, prefixed with a sigil (>>AIE):

>>AIE {"type":"write_pending","id":"abc","filePath":"...","content":"..."}

The Electron main process parses these events and broadcasts to the renderer.
When the user clicks Apply or Reject in the drawer, Electron main writes one line to the CLI's stdin, prefixed with another sigil (<<DEC):
```
<<DEC {"id":"abc","decision":"apply"}
```
The CLI buffers stdin, looks for lines matching <<DEC, and resolves the corresponding promise.

// agent.ts (extract)
async function waitForDecision(id: string): Promise<"apply" | "reject"> {
  return new Promise((resolve) => {
    pendingDecisions.set(id, resolve);
  });
}

// stdin listener
process.stdin.on("data", (chunk) => {
  for (const line of chunk.toString().split("\n")) {
    if (!line.startsWith("<<DEC ")) continue;
    const msg = JSON.parse(line.slice(6));
    const resolver = pendingDecisions.get(msg.id);
    if (resolver) {
      resolver(msg.decision);
      pendingDecisions.delete(msg.id);
    }
  }
});

This pattern is small, fast, and has zero dependencies. No HTTP server, no port to negotiate. The same channel that streams events also delivers decisions.

The CLI auto-allows writes when it detects it's running in a TTY (an interactive terminal). The bidirectional decision flow only kicks in when the CLI's stdin is piped, which is exactly when it's spawned by Electron. This means the CLI works correctly in both contexts without any flags.

If you're building a similar agent and need a control channel for diff previews or confirmations, this is the simplest approach that works.

Event taxonomy for CLI and UI parity

To make the CLI and the drawer behaviorally identical, both surfaces consume the same set of events. Defining that set up front saved me a lot of cross-surface drift later.

type AgentEvent =
  | { type: "iteration_start"; iteration: number }
  | { type: "text"; content: string }
  | { type: "thinking"; content: string }
  | { type: "tool_use"; name: string; input: unknown }
  | { type: "tool_result"; name: string; ok: boolean; preview?: string }
  | { type: "write_pending"; id: string; filePath: string; content: string }
  | { type: "write_resolved"; id: string; decision: "apply" | "reject" }
  | { type: "iteration_end"; usage: TokenUsage }
  | { type: "done"; totalCost: number; durationMs: number }
  | { type: "error"; message: string };

The agent loop emits these as it streams. Both surfaces consume them:

The CLI prints them as ANSI-colored text (dim for headers, blue for tool calls, magenta for thinking).
The drawer renders them as React components (a status bar, a diff card, a thinking-collapsible).

Having a single union type for these events means every new event type forces an exhaustiveness check at both surfaces. TypeScript's never falls through if I add a new type and forget to handle it.

The lesson here is broader than this project: when you have two consumers of the same logical stream (CLI plus UI, server plus client, main plus worker), define the message type as a discriminated union in one place and import it from both consumers. The discriminated union is your contract.

What I learned

A few takeaways from the project:

The Agent SDK is the right abstraction now. Six months ago I would have written a custom agent loop. Today I wouldn't. The SDK handles prompt caching, tool dispatch, streaming, and structured outputs well enough that custom code is mostly noise. The only piece I genuinely needed to write was the safety gate, which the SDK exposes as a hook.

Prompt caching changes the cost math. Without caching, design system tasks were costing 5 to 15 cents each. With caching, they're 1 to 3 cents. Multiply by hundreds of tasks per month and the savings are real. The trick is keeping the system prompt stable so the cache hits.

In-process MCP beats network MCP for domain agents. I started by considering a Figma MCP server that ran as a separate process. It would have been the right answer if I wanted other agents to use the same tools. For one agent with its own domain, the in-process variant was simpler, faster, and easier to debug.

Bidirectional stdin is a perfectly good control channel. I almost reached for WebSockets. Glad I didn't. The sigil-prefixed line protocol is 30 lines of code on each side, has no dependencies, and the entire flow is debuggable by piping to cat.

Diff preview is the feature that makes the tool trustable. Without it, users (including me) hesitate to run the agent on real code. With it, the agent feels like a junior engineer pairing with you: it proposes the change, you review, you approve. The flow becomes muscle memory.

Path allowlist is the floor of safety. Diff preview catches what makes it through; the allowlist makes sure the destination is sensible in the first place. Both layers matter.

Stable prompts beat clever prompts. Every time I tried to be clever with the system prompt (giving the model framework-specific tips, anti-patterns to avoid, edge case warnings), the result got worse. The simplest, most factual prompt about the codebase conventions was the best.

Don't try to be Cursor. The temptation is to expand scope: support arbitrary edits across the whole codebase, persistent multi-turn conversation, autonomous PR generation. None of these belong in a domain-specific agent. AI Polish stayed narrow on purpose: design system files, single-task invocations, diff preview every time. The narrowness is the feature.

Wrapping up

The pattern that survived: a small CLI that wraps the Claude Agent SDK with a stable system prompt, a few custom tools, a path allowlist enforced at tool-use time, a bidirectional stdin channel for diff previews, and a discriminated union of events that both the CLI and the in-app drawer consume.

589 lines of TypeScript. Four runtime dependencies. No build step. Cents per task with prompt caching. Real productivity wins for the slow, mechanical parts of design system work.

If you're considering building a domain-specific coding agent, the framework here is the one I'd recommend starting with. The Agent SDK does the heavy lifting. The interesting work is in the prompt, the safety boundaries, and the UX around what the agent proposes.

The narrowness is the feature. Pick one job. Make the agent excellent at it. Resist the urge to be Cursor.

#ai #llm #reactjs #design-systems #electron

How I Built an AI Coding Agent for My Design System (in 589 Lines)

Table of contents

Prerequisites

What AI Polish does

The stack

Two surfaces, one core

The architecture

A stable system prompt for caching

In-process MCP tools

The path allowlist safety gate

Diff preview via bidirectional stdin

Event taxonomy for CLI and UI parity

What I learned

Wrapping up

Comments

More from this blog

Removed

Removed

smoke test

How I Built a Figma-Like Canvas Editor in React (and What I Learned)

Command Palette

Table of contents

Prerequisites

What AI Polish does

The stack

Two surfaces, one core

The architecture

A stable system prompt for caching

In-process MCP tools

The path allowlist safety gate

Diff preview via bidirectional stdin

Event taxonomy for CLI and UI parity

What I learned

Wrapping up

Comments

More from this blog