How to Test Your App from Claude Code with MCP

Aryan · June 2, 2026 · 5 min read

The Model Context Protocol lets Claude Code call external tools. The interesting move is connecting it to a UX testing tool — so the same model that wrote your code can test it without leaving the chat. This guide covers what MCP is, what setup looks like, what changes day-to-day, and where it falls short today.

What is MCP and why does this matter?

MCP is an open protocol from Anthropic that lets AI assistants call external tools and read external data through a standard interface. Editors like Claude Code, Cursor, and Codex CLI already support it.

The shift it enables is that the AI editor stops being a code generator and becomes a feedback loop. It writes the code, runs it, observes the result, and decides what to do next. Until recently, "observes the result" mostly meant reading logs. With MCP, it can also mean running a real usability test against the dev server it just helped write.

What does the setup look like?

For Swarm's MCP server, the install is one command from any terminal. It registers Swarm with Claude Code and Codex CLI and opens a browser tab to sign you in:

npx @useswarm/mcp@latest setup

After that, restart your editor and tell it to test something. The setup auto-detects both Claude Code and Codex; for Cursor, a small JSON snippet goes in ~/.cursor/mcp.json. Credentials live at ~/.useswarm/config.json and the MCP runs locally over stdio.

How does the testing actually work?

When you ask the editor to test your localhost, it calls the MCP server's dev_test tool with a target URL, a goal in plain English, and an audience. The MCP opens a short-lived Cloudflare tunnel so Swarm's cloud agents can reach your dev server. AI personas in the audience you described navigate the flow.

Results stream back through dev_watch as structured UX findings — each with a severity, the step where it occurred, and a suggested fix. The model can act on them right there. Your code never leaves your machine; only HTTP requests come through the tunnel.

What changes day-to-day?

The loop changes from "write code, reload, check it yourself" to "write code, ask the editor to test it." A typical prompt is something like:

Test localhost:3000 with goal "complete the signup flow". Audience: first-time SaaS users.

Within a few minutes the model has a list of findings: confusing copy, broken validation, missing affordances, drop-off points. It can then propose specific fixes and re-test by issuing the same command after the fix lands. The savings compound on iterative flows like signup, onboarding, and checkout, where small copy changes have outsized impact.

What are the limits today?

Three things to watch. First, AI personas are good at logical friction — broken inputs, confusing copy, missing affordances — but worse at aesthetic or emotional judgments. They will not tell you whether your brand feels right.

Second, very stateful flows (multi-step wizards with branching state) are still hit-or-miss. The persona may take a path you didn't intend. Specify the exact audience and goal to reduce drift.

Third, the editor's own context limits how much detail you get back inline. For long sessions, the structured issue list is the durable record — the editor's chat is a glance, not an archive.

How do I try it?

The free tier on Swarm includes 5 lifetime test runs — enough to wire the MCP up, run a real test against your dev server, and decide whether the loop is worth it.

Set up the MCP server and you can have your editor test a real flow in about 3 minutes. After that, the prompts get short: "test my checkout, same audience as last time."