vLLM Explained: The Inference Engine Behind Agent Stacks
How vLLM works under the hood, why PagedAttention matters for agent workloads, and where it fits in a production agent infrastructure stack in 2026.
TL;DR — vLLM is the open-source inference engine that most production agent stacks run on top of. Its core trick, PagedAttention, borrows virtual-memory paging from operating systems to eliminate KV cache waste, delivering up to 24x higher throughput than naive HuggingFace serving. If your agents consume tokens at scale, vLLM is likely the layer turning GPU memory into usable output.
Why an Agent Builder Should Care About Inference Engines
Most agent tutorials wave at the inference layer and move on. You point your agent at an API endpoint, tokens come back, everyone’s happy. I was happy too — right up until I had 50 concurrent agent sessions each chewing through 32K-token contexts in tool-call loops, and watched p95 latency climb to 12 seconds while the GPU sat at 40% utilization. The model wasn’t slow. The serving was.
That’s the layer I want to talk about. Inference engines sit between your GPU and your agent runtime, deciding how to batch requests, manage memory, and schedule generation. vLLM is the one that defined the modern approach, and if you run agents at any real volume, it’s probably already in your stack whether you put it there or not.
What vLLM Actually Is
vLLM is a high-throughput LLM serving engine built at UC Berkeley’s Sky Computing Lab. Released in 2023, it introduced PagedAttention — an algorithm that treats KV cache memory the way an OS treats RAM: as pages that can be allocated, freed, and shared without contiguous allocation.
The numbers that matter in 2026:
| Metric | Value |
|---|---|
| GitHub stars | 55K+ |
| OpenRank (May 2026) | 885.72 |
| Active contributors | 898 |
| Current version | v0.7.x |
| Supported models | 100+ architectures |
| License | Apache 2.0 |
vLLM is not a model. It’s not a framework. It’s the engine that makes models serve requests efficiently. Think of it as the database engine to your application — you don’t interact with it directly, but everything breaks if it’s slow.
The Core Idea: PagedAttention
Every LLM generates tokens by maintaining a KV (key-value) cache — a growing buffer that stores past attention states so the model doesn’t recompute them. The problem: traditional serving allocates a contiguous memory block for the maximum possible sequence length upfront. A 32K context window for a 70B model can eat 4GB of GPU memory per request, even if the actual sequence only uses 2K tokens.
That’s like malloc’ing 32GB for a string that’s 200 bytes long.
PagedAttention fixes this by splitting KV cache into fixed-size pages (blocks). Pages are allocated on demand as tokens are generated, stored non-contiguously in GPU memory, and freed immediately when the request completes. The result:
Traditional: [████████████████████████░░░░░░░░] ← 70% wasted
PagedAttention: [██][██][██][██][ ][ ][ ][ ][ ] ← allocate as needed
This means vLLM can serve 3-5x more concurrent requests on the same GPU. For agent workloads — where you might have dozens of sessions running simultaneously, each with variable context lengths from tool calls — that’s the difference between needing 8 GPUs and needing 2.
Architecture in a Production Stack
Here’s where vLLM sits in a typical agent deployment:
┌─────────────────────────────────────────────┐
│ Agent Runtime (your code) │
│ Claude Code / OpenHands / custom agent │
├─────────────────────────────────────────────┤
│ API Gateway / Router │
│ LiteLLM / SandBase / OpenRouter │
├─────────────────────────────────────────────┤
│ Inference Engine │
│ vLLM ←── you are here │
├─────────────────────────────────────────────┤
│ Hardware │
│ H100 / A100 / L40S / AMD MI300X │
└─────────────────────────────────────────────┘
vLLM exposes an OpenAI-compatible API out of the box. You can literally replace https://api.openai.com/v1 with http://your-vllm-server:8000/v1 and your agent code doesn’t change. That’s why it slots into existing stacks with near-zero friction.
Key Features That Matter for Agents
Not every vLLM feature matters equally for agent workloads. These do:
Continuous batching — Traditional batching waits until N requests arrive, then processes them together. Continuous batching dynamically adds new requests to an in-progress batch as GPU capacity frees up. For agents, this means your tool-call responses don’t wait in a queue behind someone’s 4K-token essay.
Prefix caching — When multiple requests share the same system prompt (common in agent fleets where every session starts with the same instructions), vLLM caches that prefix’s KV state and reuses it. If your agent system prompt is 2K tokens and you’re serving 100 sessions, you compute those 2K tokens once instead of 100 times.
Tensor parallelism — Split a model across multiple GPUs on one machine. A 70B parameter model at FP16 needs ~140GB of VRAM. Two H100s (80GB each) can serve it with TP=2. vLLM handles the split and communication automatically.
Speculative decoding — Use a small draft model to predict multiple tokens, then verify them in parallel with the target model. In practice this gives 1.5-2x speedup on latency-sensitive agent interactions where the user is waiting for a response.
Structured output — Agents need JSON, not free-form text. vLLM supports guided generation with grammar constraints, ensuring tool-call responses parse correctly every time.
Running vLLM: The Minimal Setup
# Pull and run with Docker (simplest path)
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768
# Test it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'
That gives you a production-ready inference endpoint in one command. Your agent code hits it like any OpenAI-compatible API.
Production Tuning: The Flags That Actually Matter
The Docker one-liner gets you started. Production gets you into trouble. Here are the settings that make the difference between “it works” and “it works under load”:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enable-prefix-caching \
--max-num-seqs 128 \
--gpu-memory-utilization 0.92 \
--enable-chunked-prefill \
--disable-log-requests
What each does and why:
-
--enable-prefix-caching— This is free performance for agent workloads. If 100 sessions share the same 2K-token system prompt, the KV cache for that prefix is computed once. Without it, you’re recomputing it for every request. We measured 30-40% TTFT reduction on agent fleets with shared instructions. -
--max-num-seqs 128— Default is 256 which sounds fine until you realize each sequence reserves KV cache slots. On a 70B model with 32K context, 256 concurrent sequences will OOM before they all fill up. 128 is a safer starting point — tune upward based on actual peak concurrency. -
--gpu-memory-utilization 0.92— Default is 0.9. Pushing to 0.92 gives you ~1.6GB more usable KV cache on an 80GB H100. The risk: if the model weights + KV cache exceed the limit, requests start getting rejected. Monitorgpu_cache_usage_percin the metrics. -
--enable-chunked-prefill— Breaks long prefill requests into chunks that can be interleaved with decode operations. Without this, a single 32K-token prefill blocks the GPU for 2-3 seconds and every queued decode request waits. With it, decode latency stays bounded even when long prompts arrive. -
--ipc=host— Needed for multi-GPU tensor parallelism. Without it, NCCL inter-process communication fails silently or degrades to slow paths. -
--disable-log-requests— In production with 100+ req/s, logging every request kills throughput by 5-10%. Route request logs to your gateway instead.
The gotcha nobody warns you about: --max-model-len defaults to whatever the model config says (often 128K for modern models). But KV cache memory scales linearly with max sequence length. Setting 128K when your actual max is 32K wastes 75% of your KV cache budget. Always set this to your actual maximum expected context length.
Monitoring: What to Watch
vLLM exposes Prometheus metrics at /metrics. The ones that matter for agent workloads:
| Metric | What it tells you | Alert threshold |
|---|---|---|
vllm:num_requests_waiting | Queue depth | > 50 sustained = add capacity |
vllm:gpu_cache_usage_perc | KV cache pressure | > 0.95 = requests will be rejected |
vllm:avg_generation_throughput_toks_per_s | Output speed | Sudden drops = problem |
vllm:e2e_request_latency_seconds | Total latency per request | p99 > 10s for agents = bad UX |
vllm:num_preemptions_total | Requests evicted mid-generation | Any non-zero = capacity issue |
Preemptions are the silent killer. When KV cache fills up, vLLM evicts in-progress requests to make room for new ones. The evicted request restarts from scratch. If you see preemptions, either reduce --max-num-seqs or add GPUs.
vLLM vs SGLang: The Real Trade-off
SGLang is vLLM’s main competitor, and the benchmarks are close enough that the choice depends on your workload:
| Dimension | vLLM | SGLang |
|---|---|---|
| Throughput (high batch) | Slightly higher at large batches | Slightly lower |
| Latency p95 (streaming) | ~85ms | ~48ms |
| Prefix reuse strategy | Block-level prefix caching | RadixAttention (radix tree) |
| Structured generation | Grammar-guided | LMFE + jump-forward |
| Community size | Larger (55K stars, 898 contributors) | Smaller but growing fast (415 contributors) |
| Best fit | High-throughput serving, large fleet | Low-latency, complex prompt reuse patterns |
For agent workloads specifically, SGLang’s RadixAttention can be more efficient when you have branching prompt structures — like an agent that frequently retries with modified prompts sharing a long common prefix. vLLM’s prefix caching handles simpler cases well but doesn’t do the radix-tree deduplication that SGLang exploits.
The benchmark numbers above need context, because raw comparisons mislead. The ~85ms vs ~48ms p95 figure comes from streaming with small batch sizes (under 8 concurrent) on Llama 3.3 70B at FP8 on a single H100. Flip to large-batch throughput (batch 128+) and vLLM pulls ahead by 5-10% — the continuous batching scheduler is more mature at saturation. The honest read: at low concurrency SGLang wins latency, at high concurrency vLLM wins throughput, and in the middle they’re within noise of each other. Anyone claiming one is universally “2x faster” is quoting a benchmark tuned to their conclusion.
In practice, many production setups use both: vLLM for high-throughput batch workloads (embedding, summarization), SGLang for latency-sensitive interactive agent sessions. A gateway like LiteLLM or SandBase routes between them.
Where vLLM Struggles
It’s not all wins. Real pain points in 2026:
Cold start — Loading a 70B model takes 60-90 seconds. If your infrastructure scales to zero, that’s unacceptable for interactive agents. You need warm instances or a pre-loading strategy.
KV cache scaling under agentic workloads — The SGLang team’s roadmap (issue #21846) explicitly called out that agentic workloads are driving KV cache storage volumes beyond what current PD disaggregation handles well. Long-running agent sessions that accumulate history stress the cache system differently than stateless chat requests.
Cascading failures at scale — A real incident documented in SGLang issue #20252 showed a 90-prefill + 30-decode cluster on H20 GPUs hitting cascading failure: node restarts caused health-check failures, router removed workers, traffic concentrated on remaining nodes, and the whole cluster 503’d under load. vLLM faces the same class of problems. Industrial-scale serving requires careful router, health-check, and capacity planning beyond what the engine itself provides.
Multi-node overhead — Tensor parallelism across machines (pipeline parallelism) introduces network latency between stages. For models that fit on one node, stay single-node. Cross-node only makes sense for truly massive models (400B+).
How It Fits Your Agent Infrastructure
If you’re building agents that call self-hosted models — whether for cost, privacy, or latency reasons — vLLM is the default starting point. The decision tree:
- Using only cloud APIs (OpenAI, Anthropic, Google)? → You don’t need vLLM directly. But your API provider likely runs it.
- Self-hosting one model for your agent fleet? → vLLM with Docker, fronted by a gateway for auth/routing.
- Running multiple models (routing between cheap/strong)? → vLLM instances behind a router that picks the right model per request based on task complexity.
- Need sub-50ms latency? → Consider SGLang for the latency-critical path, vLLM for throughput.
The gateway layer (LiteLLM, SandBase) becomes critical at option 3 and beyond. You don’t want your agent code knowing which physical endpoint serves which model — that’s infrastructure’s job.
FAQ
Can I run vLLM on consumer GPUs?
Yes, with smaller models. A 4090 (24GB) can serve quantized 7-13B models comfortably. For 70B+, you need datacenter GPUs or multi-GPU setups.
Does vLLM support AMD GPUs?
Yes, ROCm support is available for MI200/MI300 series. Performance is competitive with NVIDIA on recent hardware.
How does vLLM compare to llama.cpp?
Different use cases. llama.cpp is optimized for CPU/edge/local inference with aggressive quantization. vLLM is GPU-first, optimized for serving many concurrent users at datacenter scale. An agent that runs locally on a laptop uses llama.cpp; an agent service handling 1000 users uses vLLM.
Is vLLM production-ready?
It’s running in production at dozens of companies including inference providers and model labs. LMSYS Chatbot Arena uses it. It’s as production-ready as open-source inference gets — but like any infrastructure, you still need monitoring, health checks, and capacity planning around it.
What’s the relationship between vLLM and NVIDIA Dynamo?
Dynamo is NVIDIA’s orchestration layer that sits above engines like vLLM. It handles multi-model scheduling, KV cache routing, and cluster-level scaling. Think of vLLM as the engine inside one node; Dynamo as the fleet manager across nodes.
Key Takeaways
- vLLM is the inference engine most production LLM deployments build on. PagedAttention eliminates memory waste in KV cache, enabling 3-5x more concurrent requests on the same GPU.
- For agent workloads, the features that matter most are continuous batching (no queue delays for tool calls), prefix caching (shared system prompts), and structured output (reliable JSON).
- The real competition is SGLang, which wins on latency and complex prefix reuse. Many production stacks run both behind a router.
- vLLM is infrastructure — it goes below your agent runtime and API gateway. If you’re self-hosting models for agents, it’s your starting point.


