How We Built Our Browser Automation Layer

Aryan · January 30, 2026 · 7 min read

All posts

One of the most common questions we get from engineering teams evaluating Swarm is: "How does the browser automation actually work?" This post is a technical deep dive into our infrastructure.

The stack

At the core, Swarm uses Playwright for browser control. But running Playwright locally has obvious limitations — you need compute, you need to manage browser instances, and you can't easily parallelize across dozens of sessions. That's where Browserbase comes in.

Browserbase gives us managed browser instances in the cloud. Each test session gets its own isolated browser, with its own cookies, storage, and network context. This means we can run ten personas against your product simultaneously without any interference between sessions.

The Stagehand layer

Raw Playwright commands — click this selector, type into this input, wait for this element — are brittle. They break when your UI changes, and they can't handle the ambiguity of real user interactions. A real user doesn't think "click the element with data-testid='submit-button'" — they think "click the big blue button that says Submit."

Stagehand bridges this gap. It's an AI-powered action layer that translates high-level instructions ("fill in the signup form with test data") into concrete Playwright actions. When the UI changes, Stagehand adapts because it's looking at the page the way a user would — visually, semantically, contextually.

The screenshot loop

Our CUA (Computer Use Agent) mode works on a screenshot loop: capture the current page state, send it to the LLM for analysis, receive an instruction, execute it through Stagehand, repeat. This gives us the flexibility to handle any web application without pre-written selectors or test scripts.

The LLM sees exactly what the user would see — rendered pixels, not DOM trees. This means it catches visual issues that traditional testing misses: overlapping elements, truncated text, contrast problems, layout shifts.

Authentication handling

One of the hardest problems in automated testing is authentication. Your staging environment probably requires login, and you don't want to hardcode credentials in your test scripts.

We solved this with encrypted credential storage. Credentials are encrypted with AES-256-GCM on the API server, stored as ciphertext, and only decrypted in the worker process that's running the actual test. The encryption key never touches the browser. After the test completes, the decrypted credentials are wiped from memory.

We support two auth modes: agent login (the AI persona fills out your login form, just like a real user would) and cookie injection (pre-authenticated cookies are injected into the browser session). Agent login is better for testing the login flow itself; cookie injection is faster when you just need to test authenticated pages.

What's next

We're currently working on mobile viewport simulation, network throttling for slow-connection testing, and multi-tab support for complex workflows. The browser automation layer is the foundation that everything else builds on, so we're investing heavily in making it faster, more reliable, and more capable.