Agent Daily News

SGLang Explained: The Low-Latency Inference Engine for Agents

Cover image for SGLang Explained: The Low-Latency Inference Engine for Agents

How SGLang works, why RadixAttention gives agents faster prefix reuse, and when to choose it over vLLM for production inference in 2026.

TL;DR — SGLang is a high-performance inference framework from UC Berkeley/LMSYS that attacks a different angle than vLLM: instead of just paging memory, it uses a radix tree (RadixAttention) to automatically discover and reuse shared prefixes across requests. For agent workloads where the same system prompt and conversation history repeat with slight variations, this means measurably lower latency and less redundant computation. It’s also the RL rollout backbone for training multiple frontier models.

What SGLang Is

SGLang is both a serving framework and a programming language for structured LLM interactions. The name stands for “Structured Generation Language” — it was designed from the start for workloads where models don’t just generate free text but follow schemas, call tools, and produce structured output.

I came to SGLang from vLLM, expecting a sidegrade. What changed my mind was a coding-agent workload where every request shared a 3K-token system prompt and a growing conversation history. On vLLM the TTFT crept up as histories grew; SGLang held steady because it stopped recomputing the parts it had already seen. That’s the whole pitch, and for agents it’s not a small thing.

Key numbers in 2026:

MetricValue
GitHub stars30K+
OpenRank (May 2026)455.39
Active contributors415
Hardware supportNVIDIA GB200/H100/A100, AMD MI300/MI355, Intel Xeon, TPU, Ascend NPU
GPU fleet powered400,000+ worldwide
LicenseApache 2.0

That “400,000+ GPUs” number isn’t marketing — SGLang is the rollout backend for training frontier models at multiple organizations, and it powers production inference at scale.

The Core Innovation: RadixAttention

Where vLLM’s PagedAttention asks “how do I allocate memory efficiently per request?”, SGLang’s RadixAttention asks “how do I avoid recomputing KV cache for prompt segments I’ve already seen?”

The mechanism: SGLang maintains a radix tree (trie) of all previously computed KV cache segments. When a new request arrives, the engine walks the tree to find the longest matching prefix — without any manual hint from the application.

Radix Tree (KV Cache):

Root
├── "You are an AI assistant that..." (system prompt, cached)
│   ├── "User: analyze this code..." (turn 1, cached)
│   │   ├── "Assistant: The code has..." (turn 1 response)
│   │   │   └── "User: fix the bug in..." (turn 2 → NEW, compute only this)
│   │   └── "User: now write tests..." (turn 2 alt → NEW)
│   └── "User: deploy to prod..." (different conversation → NEW from here)
└── "You are a code reviewer..." (different system prompt)

For agents, this is transformative. Consider what happens in a typical coding agent session:

  1. Every turn shares the same system prompt (2-4K tokens)
  2. The conversation history grows incrementally (each turn adds ~500 tokens)
  3. The agent often retries with slightly modified prompts
  4. Tool-call results get appended to the same prefix

With block-level prefix caching (vLLM’s approach), you get prefix reuse only when the cached block aligns exactly with the new request’s start. With RadixAttention, any shared prefix segment — at any depth in the conversation tree — gets automatically reused. The result: up to 60% reduction in time-to-first-token (TTFT) on workloads with high prefix overlap.

Why Agents Specifically Benefit

Agent workloads have a unique access pattern that plays perfectly into RadixAttention:

Multi-turn conversations with branching — An agent tries approach A, fails, backtracks, tries approach B. Both branches share the same prefix up to the decision point. RadixAttention caches the shared part once.

Repeated system prompts across sessions — If you’re running 100 agent sessions with the same instructions, the system prompt KV cache is computed once and shared across all of them. Not just at the block level — at the full prefix level.

Structured generation — Agents need JSON output for tool calls. SGLang’s compressed finite state machine (FSM) for structured output means the engine knows which tokens are valid at each position and skips impossible paths. This isn’t a post-hoc constraint — it’s integrated into the decode loop.

Speculative execution on branches — SGLang can speculatively pre-compute likely continuations. If an agent typically runs pytest after editing a file, the engine can start generating the next action while still streaming the current one.

Architecture Overview

SGLang’s architecture is a three-layer system:

┌──────────────────────────────────────────┐
│  Frontend (Python DSL)                    │
│  gen(), select(), fork(), join()          │
├──────────────────────────────────────────┤
│  Compiler / Tracer                        │
│  Builds dataflow graph from program       │
├──────────────────────────────────────────┤
│  Runtime (SGVM)                           │
│  RadixAttention + Scheduler + Workers     │
│  Prefill-Decode Disaggregation            │
│  Multi-LoRA Batching                      │
└──────────────────────────────────────────┘

The frontend gives you Python primitives for expressing structured generation programs. The compiler traces execution to build a dataflow graph. The runtime executes it with all the serving optimizations.

But you don’t have to use the DSL frontend. SGLang also exposes a standard OpenAI-compatible API — you can use it as a drop-in replacement for vLLM if you just want the serving performance without the programming model.

Benchmarks: Where SGLang Wins (and Doesn’t)

Based on published benchmarks with Llama 3.3 70B on H100:

ScenarioSGLangvLLMWinner
p95 latency (streaming, small batch)~48ms~85msSGLang
Throughput (batch 128+)CompetitiveSlightly highervLLM
TTFT with 60%+ prefix overlap~40% fasterBaselineSGLang
FP4 on Blackwell (batch 1)1.32x fasterBaselineSGLang
FP4 on Blackwell (batch 128)2.23x faster than BF16SGLang
Cold start (model load)SimilarSimilarTie
Structured output (JSON)Faster (FSM-native)Grammar-guidedSGLang

The pattern is clear: SGLang wins on latency and prefix-heavy workloads; vLLM holds the throughput crown at very large batch sizes. For interactive agent sessions (small batch, high prefix overlap, structured output), SGLang has the edge.

The RL Training Connection

Here’s something most serving-framework comparisons miss: SGLang is also a major RL rollout backend. Frameworks like AReaL, verl, Slime, and Miles use SGLang to generate rollouts during reinforcement learning training of frontier models.

Why does this matter to you as an agent builder? Because the models you’re using were likely trained with SGLang in the loop. The optimizations SGLang makes for structured generation and tool-use patterns directly influence how well models learn those behaviors during RL. It’s a feedback loop: better serving of structured agent interactions → better training data → better agent models.

Running SGLang

# Install
pip install "sglang[all]"

# Launch server (OpenAI-compatible)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 2 \
  --port 8000

# Use it like any OpenAI endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

Or with the DSL for structured agent interactions:

import sglang as sgl

@sgl.function
def agent_step(s, history, tools):
    s += sgl.system("You are a coding agent. Respond with a JSON action.")
    s += history
    s += sgl.gen("action", max_tokens=500, regex=r'\{"action": "\w+", "args": \{.*\}\}')

# The regex constraint is enforced at decode time — no post-hoc parsing failures

Prefill-Decode Disaggregation

One architectural decision that’s increasingly important for agents: SGLang supports separating prefill (processing the input prompt) from decode (generating output tokens) onto different GPU pools.

Why this matters: Agent requests often have very long prompts (conversation history + tool results) but short outputs (the next action). Prefill is compute-bound; decode is memory-bound. Running them on the same GPU means they fight for resources. Disaggregation lets you scale them independently.

The SGLang roadmap (issue #21846: “Distributed KVCache System For Agentic Workload”) makes this explicit — agentic workloads are driving KV cache volumes that require new distributed storage and transfer architectures beyond simple PD disaggregation.

The Cascading Failure Problem (A Real Incident)

Here’s something the marketing pages won’t tell you, documented in SGLang issue #20252. A team ran qwen3-32b-fp8 on an H20 cluster: 90 prefill nodes + 30 decode nodes, PD-disaggregated. Under high QPS, some prefill nodes restarted or migrated. The decode nodes kept retrying the failed connections, health checks started failing, the router removed those workers from rotation, and traffic concentrated on the remaining nodes — which then also got overwhelmed. Result: cascading failure and a flood of 503s across the whole cluster.

The lesson generalizes to any disaggregated serving setup, vLLM included: the engine running is the easy part; the failure-propagation behavior of the cluster is the hard part. When you disaggregate prefill and decode, you’ve introduced a distributed system with all its failure modes. A single node hiccup can ripple through routing and health checks into a global outage if you don’t have:

  • Circuit breakers between prefill and decode pools (stop retrying a dead node fast)
  • Health checks tuned so transient restarts don’t trigger removal
  • Load shedding (reject early rather than queue infinitely under overload)
  • Headroom so losing one node doesn’t tip the rest over

If you’re running SGLang single-node, none of this applies — you get the RadixAttention benefits with none of the distributed complexity. The failure modes only show up when you scale to multi-node disaggregation. Don’t reach for PD disaggregation until a single node genuinely can’t handle your load.

Tuning SGLang: The Flags That Matter

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 2 \
  --port 8000 \
  --mem-fraction-static 0.88 \
  --chunked-prefill-size 8192 \
  --max-running-requests 128 \
  --schedule-conservativeness 0.3
  • --mem-fraction-static 0.88 — How much GPU memory to reserve for the radix tree KV cache. Higher = more cache reuse, but too high and you OOM on weight loading. 0.88 is a safe ceiling on 80GB cards.
  • --chunked-prefill-size 8192 — Caps how many prefill tokens process per scheduler step. Lower values protect decode latency when long prompts arrive; higher values improve prefill throughput. 8192 balances both for agent traffic.
  • --schedule-conservativeness 0.3 — Lower values make the scheduler more aggressive about admitting requests (higher throughput, more preemption risk). The default 1.0 is conservative; 0.3 squeezes more throughput when you have cache headroom.

The radix tree is automatic — you don’t configure prefix matching. But its hit rate depends on traffic patterns. Watch the cache hit rate in the server logs: for agent fleets with shared system prompts it should be 60%+. If it’s low, your requests aren’t sharing prefixes the way you assumed, and you’re paying for a feature you’re not using.

When to Choose SGLang Over vLLM

Pick SGLang when:

  • Your agent sessions have high prefix overlap (same system prompt, incremental history)
  • You need low p95 latency for interactive agent use
  • Structured output (JSON tool calls) is a primary use case
  • You’re using the DSL for complex generation programs (branching, selection)
  • You want the same engine for serving and RL training

Pick vLLM when:

  • You’re optimizing for maximum throughput at large batch sizes
  • You need the largest community and widest model support
  • Your workload is mostly independent requests (low prefix overlap)
  • You want the most battle-tested option with the most deployment guides

Many production stacks use both — SGLang for the latency-sensitive agent-facing path, vLLM for batch workloads. A model gateway routes between them based on the request characteristics.

FAQ

Is SGLang production-ready?

Yes. It powers inference for organizations running 400,000+ GPUs collectively. NVIDIA includes it in their deep learning framework releases. It’s as production-ready as vLLM, just with a smaller (but rapidly growing) community.

Can I use SGLang without learning the DSL?

Absolutely. The OpenAI-compatible API works out of the box. You get RadixAttention benefits without touching the programming model. The DSL is there when you need finer control over structured generation.

Does SGLang support multi-model serving?

One SGLang instance serves one model (like vLLM). For multi-model setups, run multiple instances behind a router. The multi-LoRA batching feature lets you serve multiple fine-tuned variants of the same base model from one instance.

How does RadixAttention compare to vLLM’s prefix caching?

vLLM does block-level prefix caching — it reuses cache when a new request starts with the same token blocks as a previous one. SGLang’s radix tree is more flexible: it finds shared prefixes at any granularity and handles branching conversations where multiple requests diverge at different points. The advantage grows with conversation depth and retry patterns.

What’s the relationship with LMSYS?

SGLang is developed by the LMSYS team (the same people behind Chatbot Arena). It’s the serving engine they use to run arena evaluations at scale.

Key Takeaways

  • SGLang’s RadixAttention uses a radix tree to automatically find and reuse shared KV cache prefixes across requests — no manual configuration needed.
  • For agent workloads with high prefix overlap (shared system prompts, incremental conversations, retries), SGLang delivers measurably lower latency than block-level caching.
  • It’s also the RL rollout backbone for training frontier models, creating a feedback loop between serving optimization and model capability.
  • Use SGLang for latency-sensitive interactive agent sessions; pair it with vLLM for throughput-heavy batch work behind a routing layer.

You May Also Like