Webwright is a Microsoft Research project that flips how browser agents work. Instead of keeping one browser session alive and predicting the next click, it hands the model a terminal and a local workspace, then lets it write Python code that launches, inspects, and discards browser sessions. The durable output is not a finished task but a re-runnable script. The browsing history of the agent becomes a single code file.

What it is for
Long-horizon, real-website tasks: comparing flights, scraping listings, filling forms, checking inventory across many pages. The core idea is code-as-action. Most agents (browser-use, Stagehand) treat the live browser session as their state and select one indexed click or type per step. Webwright treats the local workspace (code, screenshots, logs) as the state. The browser is just an environment the agent spawns and throws away.
Why this matters: as models get better at writing and debugging code, the old one-action-at-a-time harness becomes the bottleneck. Date pickers, pagination, and filtering collapse into loops and functions instead of long chains of fragile pixel-level actions. Fewer rounds, less error accumulation on long tasks.
How the loop works
Deliberately small: one runner, one model endpoint, one terminal environment. Roughly 1.5K lines total, no multi-agent orchestration, no graph engine, no plugin layer.
- Send context - runner passes the task, workspace state, and recent observations to the model.
- Emit bash - model returns a thinking block plus a shell command, usually a Playwright-backed Python script.
- Return observations - environment runs it and returns terminal output, logs, screenshots, or error tracebacks.
- Refine and finish - loop continues until the agent writes a final script, reruns it in a fresh folder, and passes a self-reflection check.
The "done" gate is strict: the agent must produce a final_script.py, rerun it clean, save logs and screenshots, and pass a visual self-reflection judgement before completion is accepted.
Language and dependencies
Python 3.10+. Almost no framework weight. The only real dependencies are httpx, pydantic, playwright, and typer. Pluggable model backends for OpenAI, Anthropic, and OpenRouter (each ~150-200 lines). Browser automation runs on Playwright with Chromium.
How to install
pip install -e .
playwright install chromium
Then export the key for your backend (OPENAI_API_KEY or ANTHROPIC_API_KEY) and run a task:
python -m webwright.run.cli \
-c base.yaml -c model_openai.yaml \
-t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" \
--start-url https://www.google.com/flights \
--task-id demo_openai \
-o outputs/default
Config files stack with -c. The image_qa and self_reflection tools reuse the configured model, so an Anthropic run needs no OpenAI key.
Use as a Claude Code / Codex plugin
This is the part worth knowing. Webwright ships plugin manifests so an existing coding agent drives the loop natively, with no extra API key or cost beyond the host subscription. Same skills/webwright/ folder loads across Claude Code, Codex, OpenClaw, and Hermes.
/plugin marketplace add microsoft/Webwright
/plugin install webwright@webwright
Restart the session, then prompt in plain English or use slash commands:
/webwright:runproduces a one-shotfinal_script.pyfor the literal task values./webwright:craftproduces a reusable parameterized CLI tool, with anargparsewrapper so you can rerun it later with different arguments, e.g.python final_script.py --origin JFK --destination LAX --depart-date 2026-07-01.
Reported performance
State-of-the-art on two live-website benchmarks at a 100-step budget.
| Benchmark | Score | Note |
|---|---|---|
| Online-Mind2Web (300 tasks) | 86.7% (GPT-5.4) | Highest among open-sourced harnesses in the AutoEval category |
| Odysseys (200 long-horizon) | 60.1% (GPT-5.4) | +15.6 points over prior SOTA (vision + persistent browser) |
| Small-model + tools | Qwen-3.5-9B | Completes tasks well once 5+ reusable CLI tools exist |
Claude Opus 4.7 reaches 84.7% on Online-Mind2Web and is stronger on the hard split (80.5%). Code-as-action beats coordinate prediction across every difficulty split.
Value
A minimal, fork-friendly starting point for browser agents instead of another heavyweight platform. The agent loop is one ~450-line file, the Playwright environment ~570 lines. The real payoff is reuse: once a task is solved, the script can be parameterized, exported as a CLI, and handed to a coding agent, so the workflow is reused instead of rediscovered from scratch. The plugin path makes it directly useful inside Claude Code or Codex with zero added cost.
The Microsoft Research write-up has the full evaluation detail.
