The Bytebot Core – From Linux Container to Agent Control Surface - Bytebot

The Bytebot Core – From Linux Container to Agent Control Surface

Jul 12, 2025

This the second post in a series about the design and implementation of Bytebot. Give us a star on our open source repo.

In Part One, we talked about the design philosophy behind Bytebot: that the most powerful abstraction for digital agents is the same one we give human workers—a keyboard, a mouse, and a screen.

This post gets into the next layer down: how we actually built the environment and control surface that makes that abstraction usable in production.

The Bytebot core container is fully operational computing environment, backed by a stable automation layer (the Bytebot daemon) that exposes all the essential computer use primitives. We’ll walk through both parts here: the container OS and the daemon that drives it.

The QEMU Misadventure

When we set out to build Bytebot, the original plan was ambitious: support every OS. macOS. Windows. Linux. The idea was to build a virtual machine manager that could boot any environment and expose a universal set of computer use primitives. A single control plane, multiple backends. One abstraction to rule them all.

And we tried.

We used QEMU to run virtual machines inside Docker containers. We built image pipelines. We exposed ports. We started building a full orchestration stack.

And it was miserable.

Running QEMU inside Docker was slow, fragile, and borderline unmaintainable. Disk images bloated builds. CPU usage was brutal. Networking was brittle. And worst of all, we were spending more time debugging virtualization layers than building the product we wanted.

(The original Bytebot QEMU design)

More importantly: we realized we didn’t need all that generality.

We weren’t trying to build VMs for their own sake; we were trying to build a reliable, predictable surface for an AI agent to use a computer. And if you zoom out from “we need to support every OS” to “we need to support realistic, human-like computer use,” the requirements shift.

We didn’t need a Swiss Army knife. We needed a single, sharp blade.

The Pivot: One OS, Done Well

So we made a hard cut and dropped QEMU. We stopped trying to support multiple OSs, and built something more focused. A minimal Linux desktop environment, containerized with Docker, and designed explicitly for computer use by agents.

We started with Ubuntu, stripped it down, and layered on Xfce4. Then we added just the basics:

A browser
A password manager
An email client
A file manager
A terminal

That’s it. No start menu. No other distractions. Just the minimum environment needed to replicate most knowledge work tasks.

This wasn’t a compromise. It was a design decision.

(The Bytebot desktop)

Why Linux Only?

We chose to standardize around a single platform, Linux. Not because it’s “better,” but because it’s:

Deterministic: We control every aspect of the windowing environment (X11).
Containerizable: We can ship the whole OS in a reproducible, scriptable way.
Extensible: We can install anything—GUI apps, CLIs, agents, local LLMs.
Debuggable: We can inspect window titles, cursor positions, and screen buffers precisely.

The result is an “opinionated but flexible” desktop optimized for agent-driven interaction. It’s clean enough to be predictable for automation, but open enough that developers can test new behaviors, install dependencies, or extend the environment however they want. We see it as the starting point for a long-term rethinking of what an operating system should look like when its primary user isn’t human.

The Bytebot Daemon

Running a desktop is only half the battle. We also needed a way to control it. That’s where bytebotd comes in. It’s a headless process that exposes both REST and MCP APIs, letting language models (or other clients) issue keyboard and mouse commands in a structured way.

Every form of I/O—automation calls, live video, even event streams—travels through one port and only one port: 9990.

That means:

All keyboard, mouse, and screenshot commands go through the daemon
The VNC server is reverse-proxied through it
There’s a single ingress surface for the container—everything else can be locked down

The Core Action Surface – Keyboard, Mouse, and Vision

At the heart of Bytebot is a deliberate belief: every interaction a human can have with a computer can be reduced to three primitives:

Keyboard
Mouse
Vision (screenshots)

Everything we’ve built in the Bytebot Daemon revolves around those. No abstractions layered on top. No domain-specific workflows. Just raw, composable control over input and output.

Here’s how that looks under the hood:

Keyboard Actions

Keyboard actions support special keys, modifiers, and sequencing

type_text: types raw characters (e.g., “hello world”)
type_keys: sequential symbolic keys (e.g., ["Tab", "Enter"])
press_keys: key combinations with modifiers (e.g., ["Control", "S"])

Mouse Actions

Every mouse action supports optional modifiers (like Shift, Control) to enable complex behaviors:

move_mouse: to any coordinate
click_mouse: left/right/middle, any count, with modifiers
press_mouse: long pressing or releasing
drag_mouse: path with held button
trace_mouse: custom curves for signatures, etc.
scroll: vertical/horizontal scrolling with directional control

These actions offer low-level precision without locking the model into any pre-defined UI schema.

Vision

screenshot: full-frame X11 capture

Screenshots are the agent’s eyes. It’s not DOM-level parsing - it’s real, human-like screen reading. This allows for reasoning over UI states, modal visibility, spatial layouts, and more.

Why These Three Primitives Are Enough

We believe in minimal primitives that compose cleanly:

Keyboard + Mouse can control any interface
Vision lets you read the result

That’s it. With these three, a model can:

Together, they give any model with tool-calling and visual reasoning the ability to act like a real user.

In Part Three, we’ll explore how the Bytebot Agent consumes this foundation: how tasks are structured, how models interact with the tools, and how real-world workflows emerge from simple primitives and a tight event loop.

Why the Simplest Desktop Agent Abstraction Wins ›

Start Automating

In Minutes

Quick Start

Join Discord