Google Builds Computer Use Into Gemini 3.5 Flash as a Native Tool

Gemini 3.5 Flash computer use folds screen control into one cheap model with prompt-injection safeguards, though it is a preview and the benchmark is self-reported.

By Daniel Mercer Edited by Maria Konash Published:
Google makes computer use a built-in tool in Gemini 3.5 Flash, letting agents control browsers, apps and desktops. Image: Google

Google has turned computer use into a built-in tool inside Gemini 3.5 Flash, letting developers build AI agents that can see a screen and click, type, scroll and navigate across browsers, mobile devices and desktops.

The capability was previously available only through a separate Gemini 2.5 computer use model launched in October 2025. Now it sits alongside Flash’s other native tools, such as function calling, Google Search grounding and Maps, so a single agent can look something up, call a function and operate an interface without routing between models. It is available as a public preview through the Gemini API and the renamed Gemini Enterprise Agent Platform.

The shift is architectural more than headline-grabbing. Computer use works through a continuous loop: the developer’s app sends Gemini a screenshot and a goal, the model identifies on-screen elements and returns a structured action such as a click at specific coordinates, the app executes it, and a new screenshot goes back.

Folding that into the main model collapses what was a two-model workflow into one inference pass, and each action now carries an “intent” field explaining the model’s reasoning, which can serve as an audit log. Google pitches the tool for long-horizon enterprise automation such as continuous software testing, repetitive form filling and knowledge work across professional applications.

Performance claims rest on a single, self-reported figure. Google says Gemini 3.5 Flash scores 78.4% on the OSWorld-Verified benchmark, up sharply from 65.1% for the prior Flash generation and effectively tied with OpenAI’s GPT-5.5 at 78.7%.

Every score on that leaderboard is reported by the model maker without independent verification, so the numbers are best read as directional. Google has not published benchmarks comparing the built-in tool to the old standalone model, nor named customers using it. Flash is one of Google’s cheaper models, and the company frames its main edge as doing this work at roughly a third of GPT-5.5’s per-token cost.

Why Safety Is Central

Computer use agents widen the attack surface, since a model acting on live screens can be hijacked by malicious instructions hidden in the content it reads, known as prompt injection. Google says it applied targeted adversarial training to make Flash more resistant, and is releasing two optional enterprise safeguards: one that requires explicit user confirmation before sensitive or irreversible actions, and one that automatically halts a task if an indirect prompt injection is detected.

The company recommends a defense-in-depth approach combining these with sandboxing, human review and strict access controls. Those claims are described in a blog post rather than backed by published red-team results, and Google’s own caution signals how risky the jump from demo to production remains.

A Three-Way Race

The move sharpens competition over agents that act rather than just chat. Anthropic introduced computer use for Claude in late 2024, and OpenAI has its own agent offerings, so the three leading labs are now competing less on whether a model can click a button than on which can do it safely and cheaply inside regulated environments.

Google’s bet is integration and price: a single, fast, inexpensive model that already grounds in Search and Maps and can now drive the screen too. For buyers, the open questions are reliability over long tasks, how often safety prompts interrupt automation, and whether self-reported benchmarks hold up once independent testing and real workflows arrive.

AI & Machine Learning, Enterprise Tech, News
Exit mobile version