How Do I Get My AI Agent to Use the Browser?
It's one of the most common questions from developers building AI agents: how do I get my agent to actually open a browser, click things, fill out forms, and interact with the web like a human?
The short answer is: you build browser tools and expose them to your agent. The longer answer involves understanding the infrastructure, writing the tooling, and deciding how much of that you actually want to manage yourself.
Let's go from zero to a fully browser-capable agent — and then show you how Gyld lets you skip most of it.
What It Actually Takes to Give an Agent a Browser
AI models don't have eyes. They can't just "open Chrome" and start clicking around. What they can do is call tools — functions you define and expose to the model that describe an action, accept parameters, and return a result.
To give an agent browser access, you need to:
- Spin up a browser instance (Playwright, Puppeteer, Selenium, etc.)
- Wrap browser actions as callable tools (navigate, click, type, screenshot, read DOM, etc.)
- Feed those tools to your model via the API's tool-use interface
- Handle the tool call loop — the model calls a tool, you execute it, you return the result, the model decides what to do next
- Manage state — the browser session needs to persist across multiple tool calls within a single task
None of this is magic, but it's a non-trivial amount of plumbing to get right.
Building Browser Tools with Playwright
Playwright is the go-to library for this. It's fast, reliable, and supports Chromium, Firefox, and WebKit. Here's what a basic browser tool implementation looks like.
Step 1: Initialize the Browser
import { chromium, Browser, Page } from 'playwright'; let browser: Browser | null = null; let page: Page | null = null; async function getBrowser(): Promise<Page> { if (!browser) { browser = await chromium.launch({ headless: false, // Set to true for background operation args: ['--no-sandbox', '--disable-setuid-sandbox'] }); } if (!page) { const context = await browser.newContext({ viewport: { width: 1280, height: 800 }, userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' }); page = await context.newPage(); } return page; }
Step 2: Define Your Tool Functions
Each browser capability becomes its own function. These are the actual implementations that get called when the model requests a tool use.
async function navigateTo(url: string): Promise<string> { const page = await getBrowser(); await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 }); const title = await page.title(); return `Navigated to ${url}. Page title: "${title}"`; } async function clickElement(selector: string): Promise<string> { const page = await getBrowser(); try { await page.waitForSelector(selector, { timeout: 5000 }); await page.click(selector); return `Clicked element: ${selector}`; } catch (err) { return `Failed to click ${selector}: ${err.message}`; } } async function typeText(selector: string, text: string): Promise<string> { const page = await getBrowser(); await page.waitForSelector(selector, { timeout: 5000 }); await page.fill(selector, text); return `Typed "${text}" into ${selector}`; } async function getPageContent(): Promise<string> { const page = await getBrowser(); // Strip scripts and styles, return readable text const content = await page.evaluate(() => { const scripts = document.querySelectorAll('script, style'); scripts.forEach(el => el.remove()); return document.body.innerText.substring(0, 8000); // Truncate for context window }); return content; } async function takeScreenshot(): Promise<string> { const page = await getBrowser(); const buffer = await page.screenshot({ type: 'png' }); return buffer.toString('base64'); } async function waitForElement(selector: string, timeout = 10000): Promise<string> { const page = await getBrowser(); try { await page.waitForSelector(selector, { timeout }); return `Element ${selector} is visible`; } catch { return `Element ${selector} did not appear within ${timeout}ms`; } } async function scrollPage(direction: 'up' | 'down', amount = 500): Promise<string> { const page = await getBrowser(); await page.evaluate((dir, px) => { window.scrollBy(0, dir === 'down' ? px : -px); }, direction, amount); return `Scrolled ${direction} by ${amount}px`; } async function executeScript(script: string): Promise<string> { const page = await getBrowser(); const result = await page.evaluate(script); return JSON.stringify(result); }
Step 3: Define Tool Schemas for the Model
This is where most of the work lives. Every tool needs a JSON schema definition so the model knows what it can call and what parameters to pass.
const browserTools = [ { name: "navigate_to", description: "Navigate the browser to a specific URL", input_schema: { type: "object", properties: { url: { type: "string", description: "The full URL to navigate to, including https://" } }, required: ["url"] } }, { name: "click_element", description: "Click an element on the page using a CSS selector or text content", input_schema: { type: "object", properties: { selector: { type: "string", description: "CSS selector, XPath, or text selector (e.g. 'button.submit', 'text=Submit', '#login-btn')" } }, required: ["selector"] } }, { name: "type_text", description: "Type text into an input field or textarea", input_schema: { type: "object", properties: { selector: { type: "string", description: "CSS selector for the input field" }, text: { type: "string", description: "The text to type into the field" } }, required: ["selector", "text"] } }, { name: "get_page_content", description: "Read the visible text content of the current page", input_schema: { type: "object", properties: {}, required: [] } }, { name: "take_screenshot", description: "Take a screenshot of the current browser state and return it as base64", input_schema: { type: "object", properties: {}, required: [] } }, { name: "wait_for_element", description: "Wait for an element to appear on the page before proceeding", input_schema: { type: "object", properties: { selector: { type: "string" }, timeout: { type: "number", description: "Maximum milliseconds to wait (default: 10000)" } }, required: ["selector"] } }, { name: "scroll_page", description: "Scroll the page up or down", input_schema: { type: "object", properties: { direction: { type: "string", enum: ["up", "down"] }, amount: { type: "number", description: "Pixels to scroll (default: 500)" } }, required: ["direction"] } } ];
Step 4: Build the Tool Call Loop
This is the orchestration layer — the loop that keeps running until the model decides it's done.
import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic(); async function runBrowserAgent(task: string): Promise<string> { const messages: any[] = [ { role: 'user', content: task } ]; while (true) { const response = await client.messages.create({ model: 'claude-opus-4-5', max_tokens: 4096, tools: browserTools, messages, system: `You are a browser automation agent. Use the browser tools to complete the task. Take screenshots when you need to understand the current page state. Be methodical — navigate, read content, then interact with elements.` }); // Add assistant response to history messages.push({ role: 'assistant', content: response.content }); // If model is done (no more tool calls), return the final answer if (response.stop_reason === 'end_turn') { const textBlock = response.content.find(b => b.type === 'text'); return textBlock?.text || 'Task completed'; } // Process tool calls const toolResults: any[] = []; for (const block of response.content) { if (block.type !== 'tool_use') continue; let result: string; switch (block.name) { case 'navigate_to': result = await navigateTo(block.input.url); break; case 'click_element': result = await clickElement(block.input.selector); break; case 'type_text': result = await typeText(block.input.selector, block.input.text); break; case 'get_page_content': result = await getPageContent(); break; case 'take_screenshot': result = await takeScreenshot(); break; case 'wait_for_element': result = await waitForElement(block.input.selector, block.input.timeout); break; case 'scroll_page': result = await scrollPage(block.input.direction, block.input.amount); break; default: result = `Unknown tool: ${block.name}`; } toolResults.push({ type: 'tool_result', tool_use_id: block.id, content: result }); } // Feed results back to the model messages.push({ role: 'user', content: toolResults }); } }
Step 5: Use It
const result = await runBrowserAgent( 'Go to linkedin.com/in/johndoe, find his current job title and company, and return both' ); console.log(result);
That's a functional browser agent. But look at everything you just had to build.
What You're Actually Managing
Once you have the above working, the real work begins:
Error handling at scale. Pages time out. Selectors change. Modals appear unexpectedly. Captchas block progress. Your agent needs graceful degradation for every failure mode, and those failure modes are endless.
Screenshot → vision loop. For complex pages, you often need to pass screenshots back to a vision-capable model to understand the current state before deciding on the next selector. This adds latency and cost per step.
Session management. One task might require multiple pages, authenticated sessions, cookies that need to persist across steps, or even multiple browser tabs.
Context window management. Page content can be enormous. You need smart truncation, content extraction, and summarization strategies to avoid blowing through your context window on a single get_page_content call.
Concurrency and queueing. If multiple tasks need the browser simultaneously, you need a queue. If tasks are long-running, you need background job infrastructure.
Deployment. Where does this run? On a server, you need a real display or virtual framebuffer. In the cloud, you're fighting bot detection the whole way (see our previous post on why local execution wins for authenticated browsing).
Building this well is a multi-week engineering project. And that's before you've written a single line of your actual product.
The Gyld Approach: Flip a Switch
Gyld is built on the idea that you shouldn't have to be a Playwright expert to give your AI employees browser access.
When you create an agent in Gyld, browser capability is a toggle — not a software project.
Instead of writing tool schemas, managing browser state, building the tool call loop, handling errors, and figuring out deployment, you describe what you need in plain language:
"I need my agent to check our competitors' pricing pages once a week and update a spreadsheet with any changes."
"When a lead submits our form, have the agent look up their LinkedIn profile and enrich the contact record with their current title and company size."
"Monitor our Google Business listing every morning and flag any new reviews so I can respond within the hour."
Gyld handles the browser session, the tool orchestration, the error recovery, and the results — you get the output.
The same agent that handles your emails through Gmail, your invoices through QuickBooks, and your customer records through Salesforce can also open a browser and interact with any website on the web. It's one unified agent, not a collection of disconnected scripts.
Who Should Build vs. Who Should Buy
Build your own browser tools if:
- You need deep customization of the browser environment (custom fingerprinting, specific proxy configurations, complex multi-session workflows)
- You're building a product where browser automation is the core feature you're selling
- You have engineering resources to maintain it and the sites you're targeting change frequently
Use Gyld if:
- Browser automation is a means to an end — you need data or actions from the web, not a browser automation platform
- You want your non-technical team members to be able to spin up browser-capable agents without writing code
- You're already using Gyld agents for other workflows and want to add web interaction without building a separate system
The code above works. It's solid, and if you need full control it's a reasonable foundation to build on. But if your goal is to get an agent browsing the web and doing useful work by tomorrow, Gyld is the faster path.
Get Started
If you want to dig into building your own Playwright-based browser tools, the full code from this post is a working foundation — add error handling, expand the tool set, and wire it into your agent framework of choice.
If you want a browser-capable AI employee without the infrastructure overhead, try Gyld and see how fast you can go from idea to automated workflow.
Gyld gives small businesses AI employees that actually get work done — connected to the tools you already use, with browser access built in.