Frequently Asked Questions

Practical questions a senior engineer would ask when evaluating or operating vLLM. Click any question to expand the answer.

GPU utilization is high but throughput is lower than expected -- what's the diagnostic playbook?
High GPU utilization doesn't guarantee high throughput. Check these in order: (1) KV cache contention -- monitor gpu_cache_usage_perc in Prometheus. If at 100%, requests are queueing for memory. Reduce max-model-len or add GPUs. (2) Excessive preemption -- check num_preemptions. Each preemption wastes all KV cache computation for that request. (3) Prefill bottleneck -- long prompts tie up the GPU. Enable chunked prefill with --enable-chunked-prefill. (4) TP communication overhead -- if using tensor parallelism without NVLink, all-reduce may be the bottleneck.
How do I right-size the max-model-len parameter?
max-model-len directly determines max KV cache size per request, which determines concurrent capacity. The formula: num_concurrent = available_kv_memory / (max_model_len * per_token_kv_size * num_layers). If your workload rarely exceeds 4K tokens, setting this to 4096 instead of the model's 128K max frees enormous memory for additional concurrent requests. Always profile actual prompt + completion lengths first.
When should I use speculative decoding vs. scaling with more GPUs?
Speculative decoding reduces latency (time per output token) at some throughput cost. More GPUs reduce latency AND increase throughput. Use spec decode when: latency is primary, you have a well-matched draft model with >70% acceptance rates, and your GPU is underutilized during decode. Don't use it when: throughput is the priority, acceptance rate is low (<50%), or you're already compute-bound.
What happens to the partial KV cache when a request is preempted?
In the V1 engine, preempted requests lose their entire partial KV cache. The blocks are freed and the request moves back to the waiting queue. When rescheduled, the KV cache is recomputed from scratch -- there is no swap-to-CPU mechanism. This means preemption is expensive. Monitor num_preemptions. If non-zero in steady state, you need more KV cache capacity (more GPU memory or lower max-model-len).
What's the practical limit on concurrent requests for my setup?
Depends on GPU memory, model size, and average sequence length. Rough examples: Llama-3.1-8B on A100-80GB with max-model-len=4096 = 200-400 concurrent requests. 70B model on 4xA100-80GB TP=4 = 50-150 concurrent. To find yours: gradually increase load while monitoring gpu_cache_usage_perc and num_requests_waiting. When cache hits 100% and waiting grows, you've found the ceiling.
Should I use FP8, AWQ, or GPTQ quantization?
FP8: Simplest -- quantizes at serve-time, works on Hopper GPUs (H100/H200), ~2x memory reduction with minimal quality loss. AWQ: Requires pre-quantized model, works on all CUDA GPUs, 4-bit (~4x memory reduction), good quality. GPTQ: Similar to AWQ but AWQ generally produces better results. Choose based on: hardware (FP8 for Hopper), available pre-quantized models (check HuggingFace), and quality requirements.
How do I serve multiple models on the same GPU infrastructure?
Three approaches: (1) Separate instances per model, each bound to specific GPUs. Simple but wastes memory if underutilized. (2) LoRA adapters on a shared base with --enable-lora. Most memory-efficient when models share a base architecture. (3) Ray Serve routing in front of multiple vLLM instances, with automatic routing and autoscaling per model. Best for production multi-model deployments.
What happens when vLLM receives more requests than it can handle?
New requests enter the waiting queue. The scheduler promotes them to running as KV cache blocks free up. If the queue grows unbounded, clients experience increasing latency. There is no hard rejection -- requests wait indefinitely unless the client times out. In production, place a load balancer or rate limiter in front of vLLM to control admission.
💡
Monitoring is key: Most operational issues with vLLM come down to KV cache pressure. The single most important metric to watch is gpu_cache_usage_perc at the /metrics endpoint. If it's consistently at 100%, everything else will suffer.