Agent Daily News

Inside OpenHands: How an AI Coding Agent Executes Code

Cover image for Inside OpenHands: How an AI Coding Agent Executes Code

A teardown of how OpenHands, the open-source AI coding agent, plans, edits files, and runs code in a sandbox: the event-stream and action-observation loop.

TL;DR — OpenHands (formerly OpenDevin) runs an AI coding agent on a loop: the model emits an Action (run a command, edit a file, browse), a sandboxed runtime executes it, and the result comes back as an Observation. That action-observation cycle, logged to an event stream, is the entire architecture. The clever part isn’t the model, it’s the runtime isolation that lets a model safely run arbitrary code it just wrote.

What OpenHands Actually Is

OpenHands is an open-source platform for AI software-development agents. You give it a task (“fix the failing test in auth.py”), and it does what a developer does: reads files, writes code, runs the test, reads the error, tries again. It hit 60K+ GitHub stars by leading the open-source pack on SWE-bench Verified.

Most coverage stops at “it codes for you.” That’s not interesting. What’s interesting is the loop underneath, because that loop is the template every serious coding agent in 2026 converges on. Understanding it tells you how to build one, not just how to use one.

The Core Loop: Action and Observation

Strip away the UI and OpenHands is one loop:

flowchart LR
    A[Agent: LLM] -->|emits Action| B[Runtime]
    B -->|executes in sandbox| C[Result]
    C -->|wraps as Observation| D[Event Stream]
    D -->|appended to history| A

Each turn:

  1. The Agent (an LLM) looks at the event-stream history and emits an Action.
  2. The Runtime executes that action in an isolated sandbox.
  3. The result becomes an Observation.
  4. The observation is appended to the event stream, and the loop repeats until the agent emits a finish action.

Actions are a closed vocabulary. The main ones:

ActionWhat it doesObservation returned
CmdRunActionRun a shell commandstdout, stderr, exit code
FileEditActionEdit a file (line-range or whole)success or diff
FileReadActionRead file contentsthe file text
IPythonRunCellActionRun Python in a Jupyter kernelcell output
BrowseURLActionFetch/interact with a web pagepage content
AgentFinishActionDeclare the task doneterminates the loop

This is the key design decision: the agent doesn’t have arbitrary capabilities. It has a fixed set of actions, each with a typed observation coming back. That constraint is what makes the agent debuggable. If something goes wrong, you replay the event stream and see exactly which action produced which observation.

The Event Stream Is the Memory

There’s no separate “memory module.” The event stream — the ordered log of every action and observation — is the agent’s working memory. On each turn the agent’s context is rebuilt from this stream.

This is elegant and has a real cost. Long tasks produce long event streams, and the stream is replayed into the context window every turn. A 40-step debugging session can blow past 100K tokens. OpenHands handles this with condensation: older events get summarized to keep the context bounded. It’s the same context-window pressure that drives agent memory architectures everywhere, just applied to a coding loop.

The Runtime: Where the Real Engineering Is

Here’s the insight most people miss. The LLM picking actions is the easy part — any frontier model can do it. The hard, valuable part is the runtime: the sandboxed environment where actions execute.

Think about what you’re actually doing. A language model writes code and then runs it on a machine. If that machine is your laptop or your production server, one bad action — rm -rf, a fork bomb, an exfiltration script triggered by a poisoned dependency — and you’re done. The model has no concept of consequences; it pattern-matches its way to actions.

OpenHands runs every action inside a Docker container (or a remote runtime). The agent gets a full Linux environment: a shell, a filesystem, a Python kernel, network access. But it’s a throwaway environment. Worst case, the agent trashes a container you delete and recreate.

This is non-negotiable for any agent that executes generated code. We made the full argument in why autonomous agents need secure sandboxes, and OpenHands is the reference implementation of that principle: the sandbox isn’t a feature bolted on, it’s the foundation the whole loop sits on.

# Conceptual shape of the runtime contract
class Runtime:
    def execute(self, action: Action) -> Observation:
        """Run an action inside the isolated sandbox, return what happened.
        The agent never touches the host. Every effect is contained here."""
        ...

Why the Action-Observation Format Beats Free-Form

You might ask: why not just let the model write a script and run the whole thing? Because the action-observation cycle gives the agent feedback at every step.

When the agent runs pytest and sees ImportError: no module named requests, that observation comes back before the next action. The agent reads it and emits pip install requests as the next action. A free-form “write a script and run it” approach would fail the whole script and force a restart from scratch.

This step-by-step feedback is why coding agents got dramatically better in 2025-2026. It’s the same reason a human developer doesn’t write 200 lines then run once — you run incrementally and react to errors. The architecture encodes that workflow.

What This Means If You’re Building One

You don’t need OpenHands to apply its lessons. The transferable architecture:

  1. Define a closed action vocabulary. Don’t give the agent “do anything.” Give it run_command, edit_file, read_file, finish. Constrained actions are debuggable actions.
  2. Make every action return a typed observation. The observation is how the agent self-corrects. Rich error observations beat clean failures.
  3. Run actions in isolation, always. If your agent can execute code, that code runs in a container or microVM, never on the host. No exceptions.
  4. Treat the event log as memory. Append everything. Summarize when it grows. Replay it to rebuild context.

FAQ

Is OpenHands the same as Devin?

No. Devin is Cognition’s closed-source commercial agent. OpenHands (originally OpenDevin) is the open-source project that started as a community effort to build something similar. They share the goal — autonomous software engineering — but OpenHands is MIT-licensed and self-hostable.

Can OpenHands run without Docker?

It can use a local runtime, but you really shouldn’t run it unsandboxed. The whole safety model depends on the agent executing code in an isolated environment. Running actions directly on your host removes the only thing protecting you from a bad action.

What models work with OpenHands?

It’s model-agnostic via LiteLLM, so any OpenAI-compatible endpoint works — Claude, GPT-4o, Gemini, DeepSeek, and open-source models. Coding performance varies a lot by model; the strongest SWE-bench scores come from frontier models. You can route through a single gateway like SandBase to switch models without changing code.

How does the agent know when it’s done?

It emits an AgentFinishAction. The agent itself decides the task is complete based on the event stream — for example, after the tests pass. This is also a failure mode: agents sometimes declare victory prematurely, which is why a verification step (run the tests, check the output) before finish matters.

How is this different from a multi-agent framework like CrewAI?

OpenHands is a single agent in a tight execute loop, optimized for coding. Frameworks like CrewAI orchestrate multiple role-based agents. Different problems — see AutoGen vs CrewAI for the multi-agent side. You could even run OpenHands-style agents as nodes in a larger crew.

Key Takeaways

  • OpenHands is one loop: the agent emits an Action, a sandboxed runtime executes it, the result returns as an Observation, and the event stream records everything.
  • The event stream is the memory. There’s no separate memory module, just an append-only log replayed into context each turn (with condensation when it grows).
  • The runtime, not the model, is where the engineering value lives. Isolation is what makes it safe for a model to run code it just wrote.
  • To build your own: closed action vocabulary, typed observations, mandatory sandboxing, event log as memory. That’s the template.

You May Also Like