Webwright: Terminal-Native Web Agents

Table of Contents

What it is for
How the loop works
Language and dependencies
How to install
Use as a Claude Code / Codex plugin
Reported performance
Value

Webwright is a Microsoft Research project that flips how browser agents work. Instead of keeping one browser session alive and predicting the next click, it hands the model a terminal and a local workspace, then lets it write Python code that launches, inspects, and discards browser sessions. The durable output is not a finished task but a re-runnable script. The browsing history of the agent becomes a single code file.

microsoft/Webwright

Turn your coding models into state-of-the-art browser agents

https://github.com/microsoft/Webwright

What it is for

Long-horizon, real-website tasks: comparing flights, scraping listings, filling forms, checking inventory across many pages. The core idea is code-as-action. Most agents (browser-use, Stagehand) treat the live browser session as their state and select one indexed click or type per step. Webwright treats the local workspace (code, screenshots, logs) as the state. The browser is just an environment the agent spawns and throws away.

Why this matters: as models get better at writing and debugging code, the old one-action-at-a-time harness becomes the bottleneck. Date pickers, pagination, and filtering collapse into loops and functions instead of long chains of fragile pixel-level actions. Fewer rounds, less error accumulation on long tasks.

How the loop works

Deliberately small: one runner, one model endpoint, one terminal environment. Roughly 1.5K lines total, no multi-agent orchestration, no graph engine, no plugin layer.

Send context - runner passes the task, workspace state, and recent observations to the model.
Emit bash - model returns a thinking block plus a shell command, usually a Playwright-backed Python script.
Return observations - environment runs it and returns terminal output, logs, screenshots, or error tracebacks.
Refine and finish - loop continues until the agent writes a final script, reruns it in a fresh folder, and passes a self-reflection check.

The "done" gate is strict: the agent must produce a final_script.py, rerun it clean, save logs and screenshots, and pass a visual self-reflection judgement before completion is accepted.

Language and dependencies

Python 3.10+. Almost no framework weight. The only real dependencies are httpx, pydantic, playwright, and typer. Pluggable model backends for OpenAI, Anthropic, and OpenRouter (each ~150-200 lines). Browser automation runs on Playwright with Chromium.

How to install

pip install -e .
playwright install chromium

Then export the key for your backend (OPENAI_API_KEY or ANTHROPIC_API_KEY) and run a task:

python -m webwright.run.cli \
    -c base.yaml -c model_openai.yaml \
    -t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" \
    --start-url https://www.google.com/flights \
    --task-id demo_openai \
    -o outputs/default

Config files stack with -c. The image_qa and self_reflection tools reuse the configured model, so an Anthropic run needs no OpenAI key.

Use as a Claude Code / Codex plugin

This is the part worth knowing. Webwright ships plugin manifests so an existing coding agent drives the loop natively, with no extra API key or cost beyond the host subscription. Same skills/webwright/ folder loads across Claude Code, Codex, OpenClaw, and Hermes.

/plugin marketplace add microsoft/Webwright
/plugin install webwright@webwright

Restart the session, then prompt in plain English or use slash commands:

/webwright:run produces a one-shot final_script.py for the literal task values.
/webwright:craft produces a reusable parameterized CLI tool, with an argparse wrapper so you can rerun it later with different arguments, e.g. python final_script.py --origin JFK --destination LAX --depart-date 2026-07-01.

Reported performance

State-of-the-art on two live-website benchmarks at a 100-step budget.

Benchmark	Score	Note
Online-Mind2Web (300 tasks)	86.7% (GPT-5.4)	Highest among open-sourced harnesses in the AutoEval category
Odysseys (200 long-horizon)	60.1% (GPT-5.4)	+15.6 points over prior SOTA (vision + persistent browser)
Small-model + tools	Qwen-3.5-9B	Completes tasks well once 5+ reusable CLI tools exist

Claude Opus 4.7 reaches 84.7% on Online-Mind2Web and is stronger on the hard split (80.5%). Code-as-action beats coordinate prediction across every difficulty split.

Value

A minimal, fork-friendly starting point for browser agents instead of another heavyweight platform. The agent loop is one ~450-line file, the Playwright environment ~570 lines. The real payoff is reuse: once a task is solved, the script can be parameterized, exported as a CLI, and handed to a coding agent, so the workflow is reused instead of rediscovered from scratch. The plugin path makes it directly useful inside Claude Code or Codex with zero added cost.

The Microsoft Research write-up has the full evaluation detail.

Webwright: A Terminal Is All You Need For Web Agents

Microsoft Research blog and project page

https://www.microsoft.com/en-us/research/articles/webwright-a-terminal-is-all-you-need-for-web-agents/