Automation

Browser Agent

Autonomous web automation built on a formal state machine, with vision-guided execution, circuit breakers for fault tolerance, and a Telegram human-in-the-loop when it gets stuck.

In Plain English

This is a robot that browses the web for you. You tell it what you want done, and it clicks buttons, fills forms, reads pages, and extracts data on its own. If it hits a CAPTCHA (one of those "prove you are human" puzzles) or gets confused, it sends you a Telegram (a messaging app similar to WhatsApp) message with a screenshot and asks what to do next.

Problem

Traditional web scrapers and automation scripts are fundamentally fragile. They depend on CSS selectors, XPath expressions, and page structures that break the moment a website ships an update. Even "smart" selectors fail when a site redesigns its layout, changes its class naming convention, or wraps content in a new framework component. The result is a maintenance nightmare: scripts that worked yesterday throw errors today, and fixing them means reverse-engineering whatever the site changed.

Browser Agent takes a fundamentally different approach by treating web automation as a planning and perception problem rather than a scripting one. Instead of hard-coding paths through a page, it captures the page's accessibility tree and screenshot in parallel, then uses vision and language models to understand what it sees and decide what to do next. A formal state machine governs the entire execution flow with 13 distinct states (IDLE, PLANNING, EXECUTING, OBSERVING, VALIDATING, RECOVERING, RETRYING, REPLANNING, WAITING_HUMAN, PAUSED, CLEANING_UP, COMPLETED, FAILED) and rigorously validated transitions between them. This means the agent always knows exactly where it is in its execution lifecycle, and every transition is either permitted or blocked.

Beyond general web tasks, the system includes a specialized email intelligence module that can log into Gmail, extract messages by category (subscriptions, shipping, bills, travel), and generate structured Obsidian notes from them. It supports multiple user profiles, configurable extraction queries, and exports to markdown, folders, or CSV. And when the agent encounters something it cannot handle autonomously, such as a CAPTCHA, a login wall, or an ambiguous choice, it sends a screenshot and question to Telegram with inline buttons, waits for a human response, and resumes execution with that guidance.

Architecture

Scroll to explore diagram

Features

Formal State Machine

13 states

The agent's execution is governed by a rigorous state machine with 13 distinct states and validated transitions between them. Every state change fires entry and exit hooks, emits events to the bus, and gets recorded in a transition history for debugging. Transition guards can block invalid moves (for example, requiring a plan before execution starts). The machine tracks whether the agent is running, paused, waiting for human input, or in a terminal state, and it exposes valid next transitions at any moment. This eliminates an entire class of bugs where agents get stuck in undefined states or skip critical validation steps.

Vision-Guided Page Capture

parallel capture

Instead of relying on brittle CSS selectors, the agent captures the full page state through two channels simultaneously: a JavaScript-based accessibility tree extraction that maps interactive elements with their roles, labels, and ARIA attributes, and a PNG screenshot that gets auto-resized and base64-encoded for vision models. The snapshot trimmer enforces a token budget by prioritizing interactive elements (buttons, links, inputs, checkboxes) over static content, so the model always sees what it can interact with first. A page history system keeps the last three full states and ten summaries, with a 20% content-diff threshold for detecting meaningful page changes.

Circuit Breaker Resilience

3-state protection

Every external service call is wrapped in a circuit breaker that transitions through CLOSED (normal), OPEN (failing fast), and HALF_OPEN (testing recovery) states. After five consecutive failures, the circuit opens and rejects requests instantly for 30 seconds, preventing cascade failures when a backend is degraded. In half-open mode, three successful calls close the circuit again. A global registry manages breakers for each service (Ollama, browser, API), and all statistics (total calls, successes, failures, rejections, state changes) are tracked for monitoring. Combined with exponential backoff retries and LRU-evicting tiered caching (in-memory L1 plus SQLite L2), the system degrades gracefully rather than crashing.

Telegram Human-in-the-Loop

5-min timeout

When the agent encounters a CAPTCHA, login wall, ambiguous choice, or any situation it cannot resolve autonomously, it transitions to the WAITING_HUMAN state and sends a Telegram message with a screenshot, context, and inline keyboard buttons. The human can tap a button or type a custom response. The agent polls for responses via a background thread, and once the human answers, it resumes execution with that guidance. Questions are typed (CAPTCHA, LOGIN, AMBIGUOUS, BLOCKED, STUCK, CONFIRM) for appropriate formatting and icon selection. A five-minute timeout auto-cancels if no response arrives, and the message is updated to reflect the timeout.

PII Redaction and Security

11 PII types

Before any text is logged, cached, or sent to an external model, it passes through a PII redactor that detects 11 types of sensitive information: emails, phone numbers, SSNs, credit cards, IP addresses, dates of birth, passwords, API keys (OpenAI, GitHub PAT formats), OAuth tokens (Google, JWT), and more. Each PII type uses compiled regex patterns with VERBOSE and IGNORECASE flags. The redactor supports three replacement styles (typed markers like [REDACTED_EMAIL], generic [REDACTED], or deterministic hash placeholders), custom patterns for domain-specific data, and a quick boolean check for screening. Content isolation ensures that data from one browsing session never leaks to another.

How It Works

Task Definition and Planning

The user describes a goal in natural language through the CLI, MCP server, or API (for example, "find flights from Oslo to London next Friday"). The state machine transitions from IDLE to PLANNING, and the planner model generates a step-by-step browsing plan with expected outcomes for each step. The content router analyzes the goal text and target URL to select the appropriate model: standard models for normal content, specialized models for adult or sensitive sites. The plan is validated before execution begins, and if the planner cannot generate a viable plan, the state transitions to FAILED with a clear reason.

Parallel Page Capture

For each step in the plan, the agent transitions to EXECUTING and then OBSERVING. Two async tasks fire in parallel: the JavaScript DOM traverser walks the page's element tree up to 8 levels deep, mapping every interactive element (buttons, links, inputs, checkboxes, comboboxes) with their roles, ARIA labels, values, and disabled states. Simultaneously, a screenshot is captured, resized to the configured width while preserving aspect ratio, and encoded to base64. The snapshot is trimmed to fit within the model's token budget by prioritizing interactive and landmark elements. This dual-channel capture gives the model both structural and visual understanding of the page.

Execution with Fault Tolerance

The executor performs the planned action (click, type, navigate, scroll) through Playwright, with every call wrapped in a circuit breaker. If the action fails, the agent transitions to RECOVERING, where it can retry the action, replan from the current state, or escalate to the human-in-the-loop. The tiered cache checks whether this exact page state and action combination has been seen before, potentially skipping expensive model calls entirely. If the circuit breaker trips open after repeated failures, requests are rejected instantly for 30 seconds, giving the backend time to recover before the half-open test phase begins.

Validation and Recovery

After each action, the agent transitions to VALIDATING and captures a fresh page state. The validator checks whether the expected outcome was achieved by comparing the new page against the plan's expectations. If validation passes, the agent moves to the next step. If it fails, the state machine has multiple recovery paths: RETRYING the same action, REPLANNING from the current page state, or transitioning to WAITING_HUMAN if the situation requires human judgment. The page history system detects whether the page actually changed (using a 20% content-diff threshold) to avoid re-executing actions that already succeeded silently.

Result Extraction and Cleanup

Once all steps complete, the agent transitions to COMPLETED and extracts structured data from the final page state. For email intelligence tasks, the extractor categorizes messages by type (subscriptions, shipping, travel, bills), generates structured Obsidian notes with frontmatter metadata, and exports results to markdown files, folders, or CSV. The event bus publishes completion events with timing data, and the state machine transitions through CLEANING_UP to close browser contexts and release resources. All PII in logs and cache entries is redacted before persistence.

Tech Stack

Browser

Playwright (primary), Selenium (Gmail extraction), with sandboxed contexts per task

Models

Ollama (local vision and text), OpenAI (planning), Anthropic (complex reasoning), via pluggable ModelFactory

State Management

Formal state machine with 13 states, validated transitions, entry/exit hooks, and transition history

Resilience

Circuit breakers (registry pattern), exponential backoff retries, tiered LRU cache (memory + SQLite)

Communication

Event bus (pub/sub with correlation IDs), Telegram bot (human-in-the-loop), MCP server interface

Security

PII redaction (11 types), URL filtering, credential vault, content isolation between tasks