Agent Daily News

vLLM Explained: The Inference Engine Behind Agent Stacks

Cover image for vLLM Explained: The Inference Engine Behind Agent Stacks

How vLLM works under the hood, why PagedAttention matters for agent workloads, and where it fits in a production agent infrastructure stack in 2026.

TL;DR — vLLM is the open-source inference engine that most production agent stacks run on top of. Its core trick, PagedAttention, borrows virtual-memory paging from operating systems to eliminate KV cache waste, delivering up to 24x higher throughput than naive HuggingFace serving. If your agents consume tokens at scale, vLLM is likely the layer turning GPU memory into usable output.

Why an Agent Builder Should Care About Inference Engines

Most agent tutorials wave at the inference layer and move on. You point your agent at an API endpoint, tokens come back, everyone’s happy. I was happy too — right up until I had 50 concurrent agent sessions each chewing through 32K-token contexts in tool-call loops, and watched p95 latency climb to 12 seconds while the GPU sat at 40% utilization. The model wasn’t slow. The serving was.

That’s the layer I want to talk about. Inference engines sit between your GPU and your agent runtime, deciding how to batch requests, manage memory, and schedule generation. vLLM is the one that defined the modern approach, and if you run agents at any real volume, it’s probably already in your stack whether you put it there or not.

What vLLM Actually Is

vLLM is a high-throughput LLM serving engine built at UC Berkeley’s Sky Computing Lab. Released in 2023, it introduced PagedAttention — an algorithm that treats KV cache memory the way an OS treats RAM: as pages that can be allocated, freed, and shared without contiguous allocation.

The numbers that matter in 2026:

MetricValue
GitHub stars55K+
OpenRank (May 2026)885.72
Active contributors898
Current versionv0.7.x
Supported models100+ architectures
LicenseApache 2.0

vLLM is not a model. It’s not a framework. It’s the engine that makes models serve requests efficiently. Think of it as the database engine to your application — you don’t interact with it directly, but everything breaks if it’s slow.

The Core Idea: PagedAttention

Every LLM generates tokens by maintaining a KV (key-value) cache — a growing buffer that stores past attention states so the model doesn’t recompute them. The problem: traditional serving allocates a contiguous memory block for the maximum possible sequence length upfront. A 32K context window for a 70B model can eat 4GB of GPU memory per request, even if the actual sequence only uses 2K tokens.

That’s like malloc’ing 32GB for a string that’s 200 bytes long.

PagedAttention fixes this by splitting KV cache into fixed-size pages (blocks). Pages are allocated on demand as tokens are generated, stored non-contiguously in GPU memory, and freed immediately when the request completes. The result:

Traditional:  [████████████████████████░░░░░░░░]  ← 70% wasted
PagedAttention: [██][██][██][██][ ][ ][ ][ ][ ]  ← allocate as needed

This means vLLM can serve 3-5x more concurrent requests on the same GPU. For agent workloads — where you might have dozens of sessions running simultaneously, each with variable context lengths from tool calls — that’s the difference between needing 8 GPUs and needing 2.

Architecture in a Production Stack

Here’s where vLLM sits in a typical agent deployment:

┌─────────────────────────────────────────────┐
│  Agent Runtime (your code)                   │
│  Claude Code / OpenHands / custom agent      │
├─────────────────────────────────────────────┤
│  API Gateway / Router                        │
│  LiteLLM / SandBase / OpenRouter             │
├─────────────────────────────────────────────┤
│  Inference Engine                            │
│  vLLM  ←── you are here                     │
├─────────────────────────────────────────────┤
│  Hardware                                    │
│  H100 / A100 / L40S / AMD MI300X            │
└─────────────────────────────────────────────┘

vLLM exposes an OpenAI-compatible API out of the box. You can literally replace https://api.openai.com/v1 with http://your-vllm-server:8000/v1 and your agent code doesn’t change. That’s why it slots into existing stacks with near-zero friction.

Key Features That Matter for Agents

Not every vLLM feature matters equally for agent workloads. These do:

Continuous batching — Traditional batching waits until N requests arrive, then processes them together. Continuous batching dynamically adds new requests to an in-progress batch as GPU capacity frees up. For agents, this means your tool-call responses don’t wait in a queue behind someone’s 4K-token essay.

Prefix caching — When multiple requests share the same system prompt (common in agent fleets where every session starts with the same instructions), vLLM caches that prefix’s KV state and reuses it. If your agent system prompt is 2K tokens and you’re serving 100 sessions, you compute those 2K tokens once instead of 100 times.

Tensor parallelism — Split a model across multiple GPUs on one machine. A 70B parameter model at FP16 needs ~140GB of VRAM. Two H100s (80GB each) can serve it with TP=2. vLLM handles the split and communication automatically.

Speculative decoding — Use a small draft model to predict multiple tokens, then verify them in parallel with the target model. In practice this gives 1.5-2x speedup on latency-sensitive agent interactions where the user is waiting for a response.

Structured output — Agents need JSON, not free-form text. vLLM supports guided generation with grammar constraints, ensuring tool-call responses parse correctly every time.

Running vLLM: The Minimal Setup

# Pull and run with Docker (simplest path)
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768

# Test it
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

That gives you a production-ready inference endpoint in one command. Your agent code hits it like any OpenAI-compatible API.

Production Tuning: The Flags That Actually Matter

The Docker one-liner gets you started. Production gets you into trouble. Here are the settings that make the difference between “it works” and “it works under load”:

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --max-num-seqs 128 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --disable-log-requests

What each does and why:

  • --enable-prefix-caching — This is free performance for agent workloads. If 100 sessions share the same 2K-token system prompt, the KV cache for that prefix is computed once. Without it, you’re recomputing it for every request. We measured 30-40% TTFT reduction on agent fleets with shared instructions.

  • --max-num-seqs 128 — Default is 256 which sounds fine until you realize each sequence reserves KV cache slots. On a 70B model with 32K context, 256 concurrent sequences will OOM before they all fill up. 128 is a safer starting point — tune upward based on actual peak concurrency.

  • --gpu-memory-utilization 0.92 — Default is 0.9. Pushing to 0.92 gives you ~1.6GB more usable KV cache on an 80GB H100. The risk: if the model weights + KV cache exceed the limit, requests start getting rejected. Monitor gpu_cache_usage_perc in the metrics.

  • --enable-chunked-prefill — Breaks long prefill requests into chunks that can be interleaved with decode operations. Without this, a single 32K-token prefill blocks the GPU for 2-3 seconds and every queued decode request waits. With it, decode latency stays bounded even when long prompts arrive.

  • --ipc=host — Needed for multi-GPU tensor parallelism. Without it, NCCL inter-process communication fails silently or degrades to slow paths.

  • --disable-log-requests — In production with 100+ req/s, logging every request kills throughput by 5-10%. Route request logs to your gateway instead.

The gotcha nobody warns you about: --max-model-len defaults to whatever the model config says (often 128K for modern models). But KV cache memory scales linearly with max sequence length. Setting 128K when your actual max is 32K wastes 75% of your KV cache budget. Always set this to your actual maximum expected context length.

Monitoring: What to Watch

vLLM exposes Prometheus metrics at /metrics. The ones that matter for agent workloads:

MetricWhat it tells youAlert threshold
vllm:num_requests_waitingQueue depth> 50 sustained = add capacity
vllm:gpu_cache_usage_percKV cache pressure> 0.95 = requests will be rejected
vllm:avg_generation_throughput_toks_per_sOutput speedSudden drops = problem
vllm:e2e_request_latency_secondsTotal latency per requestp99 > 10s for agents = bad UX
vllm:num_preemptions_totalRequests evicted mid-generationAny non-zero = capacity issue

Preemptions are the silent killer. When KV cache fills up, vLLM evicts in-progress requests to make room for new ones. The evicted request restarts from scratch. If you see preemptions, either reduce --max-num-seqs or add GPUs.

vLLM vs SGLang: The Real Trade-off

SGLang is vLLM’s main competitor, and the benchmarks are close enough that the choice depends on your workload:

DimensionvLLMSGLang
Throughput (high batch)Slightly higher at large batchesSlightly lower
Latency p95 (streaming)~85ms~48ms
Prefix reuse strategyBlock-level prefix cachingRadixAttention (radix tree)
Structured generationGrammar-guidedLMFE + jump-forward
Community sizeLarger (55K stars, 898 contributors)Smaller but growing fast (415 contributors)
Best fitHigh-throughput serving, large fleetLow-latency, complex prompt reuse patterns

For agent workloads specifically, SGLang’s RadixAttention can be more efficient when you have branching prompt structures — like an agent that frequently retries with modified prompts sharing a long common prefix. vLLM’s prefix caching handles simpler cases well but doesn’t do the radix-tree deduplication that SGLang exploits.

The benchmark numbers above need context, because raw comparisons mislead. The ~85ms vs ~48ms p95 figure comes from streaming with small batch sizes (under 8 concurrent) on Llama 3.3 70B at FP8 on a single H100. Flip to large-batch throughput (batch 128+) and vLLM pulls ahead by 5-10% — the continuous batching scheduler is more mature at saturation. The honest read: at low concurrency SGLang wins latency, at high concurrency vLLM wins throughput, and in the middle they’re within noise of each other. Anyone claiming one is universally “2x faster” is quoting a benchmark tuned to their conclusion.

In practice, many production setups use both: vLLM for high-throughput batch workloads (embedding, summarization), SGLang for latency-sensitive interactive agent sessions. A gateway like LiteLLM or SandBase routes between them.

Where vLLM Struggles

It’s not all wins. Real pain points in 2026:

Cold start — Loading a 70B model takes 60-90 seconds. If your infrastructure scales to zero, that’s unacceptable for interactive agents. You need warm instances or a pre-loading strategy.

KV cache scaling under agentic workloads — The SGLang team’s roadmap (issue #21846) explicitly called out that agentic workloads are driving KV cache storage volumes beyond what current PD disaggregation handles well. Long-running agent sessions that accumulate history stress the cache system differently than stateless chat requests.

Cascading failures at scale — A real incident documented in SGLang issue #20252 showed a 90-prefill + 30-decode cluster on H20 GPUs hitting cascading failure: node restarts caused health-check failures, router removed workers, traffic concentrated on remaining nodes, and the whole cluster 503’d under load. vLLM faces the same class of problems. Industrial-scale serving requires careful router, health-check, and capacity planning beyond what the engine itself provides.

Multi-node overhead — Tensor parallelism across machines (pipeline parallelism) introduces network latency between stages. For models that fit on one node, stay single-node. Cross-node only makes sense for truly massive models (400B+).

How It Fits Your Agent Infrastructure

If you’re building agents that call self-hosted models — whether for cost, privacy, or latency reasons — vLLM is the default starting point. The decision tree:

  1. Using only cloud APIs (OpenAI, Anthropic, Google)? → You don’t need vLLM directly. But your API provider likely runs it.
  2. Self-hosting one model for your agent fleet? → vLLM with Docker, fronted by a gateway for auth/routing.
  3. Running multiple models (routing between cheap/strong)? → vLLM instances behind a router that picks the right model per request based on task complexity.
  4. Need sub-50ms latency? → Consider SGLang for the latency-critical path, vLLM for throughput.

The gateway layer (LiteLLM, SandBase) becomes critical at option 3 and beyond. You don’t want your agent code knowing which physical endpoint serves which model — that’s infrastructure’s job.

FAQ

Can I run vLLM on consumer GPUs?

Yes, with smaller models. A 4090 (24GB) can serve quantized 7-13B models comfortably. For 70B+, you need datacenter GPUs or multi-GPU setups.

Does vLLM support AMD GPUs?

Yes, ROCm support is available for MI200/MI300 series. Performance is competitive with NVIDIA on recent hardware.

How does vLLM compare to llama.cpp?

Different use cases. llama.cpp is optimized for CPU/edge/local inference with aggressive quantization. vLLM is GPU-first, optimized for serving many concurrent users at datacenter scale. An agent that runs locally on a laptop uses llama.cpp; an agent service handling 1000 users uses vLLM.

Is vLLM production-ready?

It’s running in production at dozens of companies including inference providers and model labs. LMSYS Chatbot Arena uses it. It’s as production-ready as open-source inference gets — but like any infrastructure, you still need monitoring, health checks, and capacity planning around it.

What’s the relationship between vLLM and NVIDIA Dynamo?

Dynamo is NVIDIA’s orchestration layer that sits above engines like vLLM. It handles multi-model scheduling, KV cache routing, and cluster-level scaling. Think of vLLM as the engine inside one node; Dynamo as the fleet manager across nodes.

Key Takeaways

  • vLLM is the inference engine most production LLM deployments build on. PagedAttention eliminates memory waste in KV cache, enabling 3-5x more concurrent requests on the same GPU.
  • For agent workloads, the features that matter most are continuous batching (no queue delays for tool calls), prefix caching (shared system prompts), and structured output (reliable JSON).
  • The real competition is SGLang, which wins on latency and complex prefix reuse. Many production stacks run both behind a router.
  • vLLM is infrastructure — it goes below your agent runtime and API gateway. If you’re self-hosting models for agents, it’s your starting point.

You May Also Like