Google's Gemini 2.5 Computer Use: Human-Like Web Agents

Google launches Gemini 2.5 Computer Use: a browser-focused AI agent that reads screenshots and performs human-like UI actions. Available via Gemini API on Google AI Studio and Vertex AI with safety reviews and demo access.

Chloe Nakamura Chloe Nakamura . Comments
Google's Gemini 2.5 Computer Use: Human-Like Web Agents

3 Minutes

Google has unveiled Gemini 2.5 Computer Use, a new AI model that mimics human interaction with websites and web apps. Now in public preview through the Gemini API on Google AI Studio and Vertex AI, the model is built to automate real-world browser tasks with lower latency and stronger visual reasoning.

What this model actually does (and why it matters)

Gemini 2.5 Computer Use extends Gemini 2.5 Pro’s visual understanding to perform hands-on browser actions: clicking, typing, scrolling, hovering, opening dropdowns and navigating URLs. Rather than calling web APIs, the agent analyzes screenshots of the page and returns precise UI actions to drive the interface—essentially teaching AI to use the web the way a person would.

How it works: screenshots, action loops, and client execution

The model receives three inputs: a task prompt, a screenshot of the current UI, and a short history of recent actions. It then interprets the visual layout and suggests a single UI action (for example, click this button or enter text into that field). That action is executed on the client side and a fresh screenshot is sent back to the model so the loop can continue until the task is done.

Benchmarks, demos, and what the videos show

Google says Gemini 2.5 Computer Use outperforms other tools on benchmarks like Online-Mind2Web, WebVoyager and AndroidWorld while keeping latency low. Demo clips—sped up to show flows quickly—illustrate tasks such as reorganizing sticky notes on a digital whiteboard and transferring pet records from a website into a CRM. Those examples highlight how the agent chains simple UI steps into a complex workflow.

Capabilities, limits, and platform fit

The model currently supports 13 distinct UI actions and performs best in web browsers. Google cautions it isn’t fully optimized for desktop OS-level automation yet, though preliminary mobile benchmark results are promising. Internal teams already use it for UI testing and automation across services like Search and Firebase.

Safety-first design and developer controls

To reduce misuse, every proposed action is vetted by a safety service before execution. Developers can disable specific actions or require explicit user confirmation for sensitive steps—financial transactions or other high-risk operations can be gated behind extra checks. Early-access external developers have used the model for workflow automation, assistant tools and CI-style UI testing.

How to try it today

  • Access via the Gemini API in Google AI Studio or Vertex AI.
  • Experiment with a Browserbase demo environment Google provides for testing.
  • Join early access programs to build assistants or automation tools that leverage on-screen reasoning.

Who should pay attention

Product teams building browser-based assistants, QA engineers seeking smarter UI tests, and developers automating repetitive web workflows will find Gemini 2.5 Computer Use particularly useful. If your application needs human-like interaction across complex web interfaces, this model is worth exploring.

Source: gizmochina

“I love exploring gadgets, apps, and trends that redefine how we connect, work, and play in a digital world.”

Leave a Comment

Comments