Why the Simplest Desktop Agent Abstraction Wins

Jun 20, 2025

This is first post in a series about the design and implementation of Bytebot. Give us a star on our open source repo.

We’re still in the early innings of AI agents. There are hundreds of companies building wrappers around LLMs, trying to make them more useful; more tool-aware, more stateful, more capable of completing tasks across applications. But most of them are barking up the same tree: they’re building agents that work by connecting APIs and tools in structured ways.

Bytebot was born out of a fundamentally different belief: that the simplest and most universal abstraction for agent control already exists, and we’ve been using it for decades.

The Agent as Remote Worker

Here’s the core idea: give an LLM access to a keyboard, a mouse, and a screen. Nothing more.

That’s it. That’s the interface. That’s what a human remote worker uses. And it’s the only interface you need to approximate the vast majority of digital work.

Why does this work? Because nearly all software, all workflows, and all enterprise tooling has been designed (whether explicitly or implicitly) for a human sitting at a computer. If we can simulate the inputs of a human worker and read the same outputs (screen pixels), we can plug into the same workflows. No custom integrations required.

This approach isn’t just simpler - it’s more robust, more generalizable, and more future-proof.

We Tried the Other Way First

Before the current version of Bytebot, we built it as a browser agent.

It started innocently enough: add hooks for prompting into Playwright scripts, letting LLMs handle finding selectors and xpaths:

(The first version of Bytebot, a framework for browser automation)

Then we started building agents that could remote-control browsers using models. Then we built a full-on RPA-style orchestration layer for browser-based tasks.

And we hit every wall imaginable:

No reliable drag-and-drop support
Workarounds for downloading files that broke every other week
Customers who wanted 2FA or password managers like 1Password or Okta
People wanting to fill a PDF offline, then upload it back
Apps without APIs but crucial to their workflow

It turns out, browsers are only a fraction of what people actually use. The DOM is not the whole world. So we had a choice: build 1,000 integrations… or zoom out.

The Bitter Lesson

In 2019, Rich Sutton published what’s now known as “The Bitter Lesson”. The gist is: the biggest gains in AI haven’t come from building complex logic or specialized systems. They’ve come from simple methods scaled with compute.

We saw this lesson firsthand.

Every time we built a clever workaround (a DOM compressor, a canvas interpreter, a proprietary multi-step planner), a new model would come out and blow it away. We’d have to rip it all up and start over. Again.

So instead of fighting the models, we decided to get out of their way.

We stopped designing our agent abstractions to match the limitations of current LLMs. We started designing them around the realities of human-computer interaction: assuming models will get better at dealing with screens, input events, and sequential planning. Which they are, rapidly.

Horseless Carriages vs. Autonomous Agents

There’s a natural criticism of our approach: “Aren’t you just recreating a horseless carriage?” - a legacy interface for a new kind of intelligence?

Yes. Intentionally.

We’re not pretending every task is best solved with a desktop agent. There are agent architectures that thrive on abstraction:

Research agents benefit from tool-rich environments with structured APIs: search tools, calculators, translators, embeddings, and more.
Code agents can digest an entire codebase, refactor modules, or write libraries using recursive self-feedback and powerful context windows.
Multi-agent planners can orchestrate workflows where no human-like interaction is needed at all.

But that’s not the whole world.

There’s a massive category of work that lives in the no-man’s land between APIs and deep internal logic. It’s the gnarly, hard-to-automate workflows that involve:

Jumping between apps
Copy-pasting between SaaS tools
Downloading PDFs, renaming them, and uploading them to different portals
Typing into legacy desktop software that has no public API
Handling authentication, file dialogs, drag-and-drop interactions

This is the unglamorous work: the stuff that still eats up countless hours across legal, ops, HR, finance, logistics, compliance, and enterprise support. And it’s not going away. It’s deeply embedded in real companies doing real work.

That’s the gap we’re trying to bridge. Not by replacing every workflow with a smarter agent, but by giving models a body; a way to operate inside existing systems without asking the systems to change.

Why This Abstraction Wins

So what do you get when you build from first principles—keyboard, mouse, screen?

Universality: It works across every app, every OS, every website.
Fidelity: You can click, scroll, drag, type, trace. Anything a human can do, the agent can do.
Composability: Actions can be learned, chained, recorded, reasoned over.
Observability: Screenshots are the source of truth, making agent decisions traceable.
Extensibility: Add audio, notifications, or sensors later. The core stays stable.

It’s not minimalism for its own sake. It’s choosing primitives that are stable across all computing environments and robust to change.

What This Enables for Companies

Most companies don’t want to build their own agents. They want something that works out of the box and doesn’t require reorganizing their workflows.

That’s what this architecture enables:

No APIs to maintain
No special integrations to build
No model-specific logic to write

Just tasks, tools, and screens.

Give the agent access to the same computing environment your remote team uses. Assign tasks. Let it click, type, scroll, and think. It works with what you already have.

And because the intelligence is model-agnostic and externalized, it gets better automatically as models get better. The scaffolding doesn’t change. The agent just gets smarter.

Building for the Plateau, Not the Peak

Everyone’s building for GPT-5. We’re building for what comes after.

The goal of Bytebot isn’t to chase state-of-the-art LLM performance with narrow integrations. It’s to build an environment where agents can operate reliably no matter how the intelligence improves: an interface that’s durable, general, and designed for the long game.

Next up: how we built a containerized Linux OS from scratch to support that vision.

The Age of the Desktop Agent Is Here ›

Start Automating

In Minutes

Quick Start

Join Discord