Home Page Private AI VCF Performance

How Many Users Can Your LLM Server Really Handle?

Deploying large language models (LLMs) in an enterprise environment has transitioned from a proof-of-concept exercise to a rigorous engineering discipline. Yet, accurately predicting the capacity of an inference server under real-world, concurrent load remains a formidable challenge.

Infrastructure engineers frequently confront complex configuration spaces, questioning whether tuning parameters like --max-num-batched-tokens or --gpu-memory-utilization in vLLM will optimize throughput or inadvertently degrade tail latency. Official documentation provides the mechanisms for tuning, but it rarely offers a systematic method for discovering the optimal configuration for a specific workload, hardware architecture, and strict Service Level Agreement (SLA).

To address this, we undertook a comprehensive capacity planning initiative for a 120-billion parameter Mixture-of-Experts (MoE) model (gpt-oss-120b), deployed across multiple NVIDIA H100 and H200 clusters to power an internal AI coding assistant. Rather than merely publishing our final capacity metrics, we have documented the rigorous, end-to-end methodology we developed to achieve them.

We have compiled our findings into a detailed technical white paper: SPOC: a Stateful, Profile-based Optimization for LLM Capacity Planning Methodology.

This white paper serves as a comprehensive guide to LLM performance engineering. It is designed to equip infrastructure teams with the analytical tools and empirical techniques required to:

  • Construct stateful, multi-turn datasets that accurately simulate the complex context accumulation of developers querying shared enterprise monorepos.
  • Apply multi-objective evolutionary algorithms (Optuna NSGA-II) to mathematically navigate the inference engine’s parameter space, replacing heuristic guesswork with rigorous optimization.
  • Deploy an advanced telemetry stack (Prometheus and DCGM Exporter) to correlate internal inference-engine metrics with physical hardware state.
  • Capture and interpret kernel-level NVIDIA Nsight Systems traces to identify the true architectural bottlenecks, which frequently defy the predictions of a simple theoretical roofline model.

If you are responsible for scaling LLM infrastructure, this paper provides the empirical blueprint required to transition from estimating capacity to systematically measuring and optimizing it.

The Problem with the Just Run a Benchmark Concept

Standard LLM benchmarks send a fixed prompt at a fixed concurrency and report average latency, or single turn benchmarks (MLPerf, GenAI Perf, InferenceMax). That is fine for comparing models on a leaderboard. It is not fine for capacity planning for real-world use cases, such as asking many follow-up questions for coding tasks, or for log analysis; in these situations, multi-turn traffic simulation is a must.

Real traffic is messy. In our case, 70% of users send short requests (starting around 5,000 tokens and growing up to 50,000 tokens), 20% send medium-sized requests (starting at 15,000 and growing up to 120,000 tokens), and 10% submit entire code bases for deep analysis (starting around 75,000 tokens and pushing the 128,000-token context boundary). These three segments stress the inference engine in fundamentally different ways. The short requests dominate the request rate and set the floor for time-to-first-token (TTFT). The large requests dominate GPU memory bandwidth and prefill compute. A benchmark that treats them all as average-sized requests will provide a number that does not predict where the system will actually break.

We needed something better.

What We Built

The white paper describes a framework with three core stages:

  1. Workload modeling – We defined three user profiles (P0, P1, P2) calibrated from observed usage patterns, each with its own prompt size distribution, output budget, and think time. We built a stateful corpus from open-source trajectories (togethercomputer/CoderForge-Preview and nebius/SWE-rebench-openhands-trajectories), and used Locust to simulate multi-turn streaming conversations that behave like real developers interacting with a coding assistant, including a “Partial Common Ground” geometry to simulate shared enterprise monorepos.
  1. Evolutionary parameter search – Instead of manually trying parameter combinations or running an exhaustive grid search, we used Optuna‘s NSGA-II sampler to search the vLLM parameter space at our target concurrency. NSGA-II is a multi-objective evolutionary algorithm that simultaneously optimizes throughput, time-to-first-token, and inter-token latency. It finds the Pareto front: the set of configurations where you cannot improve one metric without sacrificing another.
  1. Kernel-level profiling – This is where things got interesting. We captured NVIDIA Nsight Systems traces during steady-state load at our capacity ceilings (300 concurrent users on 4x H100, and 85 users on 2x H200). We decomposed the GPU active time into functional categories: Flash Attention, MoE Expert GEMMs, and NCCL collectives. The traces revealed that for this sparse MoE architecture at large batch sizes, the system becomes heavily bound by Attention compute and memory bandwidth, defying simple roofline predictions.

From there, we swept the best configuration across concurrency levels collecting Prometheus and DCGM Exporter hardware counters.

What You Will Learn from the Paper

The paper is meant to be both a reference and a practical guide, and addresses the following topics:

  • How to design a workload simulation that reflects real user behavior and stateful context accumulation, not stateless synthetic averages
  • How to use multi-objective optimization to search the vLLM parameter space efficiently, and see firsthand the dramatic difference that spending optimization cycles on these parameters makes in extracting maximum performance from your available GPUs
  • How to set up Prometheus and DCGM Exporter to gain simultaneous visibility into inference-engine internals and GPU hardware state
  • How to capture and interpret NVIDIA Nsight Systems kernel traces from a containerized vLLM deployment under load

Beyond the methodology, here are some of the findings that deserve special attention:

  • Chunked Prefill is a vital trade-off. To protect the inter-token latency (ITL) of ongoing generations from massive prefill spikes caused by our 128k-token users, --max-num-batched-tokens must be carefully tuned. We found that setting it to 2048 (on 4x H100) or 1024 (on 2x H200) sacrifices some TTFT speed but maintains a smooth streaming experience and prevents catastrophic CUDA graph compilation timeouts.
  • GPU utilization is not an SLA metric. We measured ~37% SM Active at the capacity ceiling. You might think 60% of the GPU’s compute capacity is being left on the table. However, pushing utilization higher by filling scheduling gaps degrades the per-step decode latency (ITL) and causes the system to fail the SLA. The paper explains why chasing higher GPU utilization can actively degrade user experience.
  • VRAM is not always the bottleneck. Even with 10% of users submitting massive 80k-128k token contexts, active KV cache usage remained remarkably low (~10.5% on 4x H100). Because our dataset simulates a shared enterprise monorepo, vLLM’s prefix caching deduplicates the shared roots efficiently. The system was fundamentally compute-bound by Attention kernels and memory bandwidth, not VRAM capacity.
  • Hardware scaling is non-linear under tail-latency constraints. The 4x H100 system achieved ~3.5x the capacity of the 2x H200 system (300 vs 85 users), rather than the expected 2x. This is due to the compounding effects of aggregate memory bandwidth, Tensor Parallelism math division, and the chunked prefill penalty on smaller GPU clusters.
  • Thermal vulnerabilities in Tensor Parallelism. Under TP > 1, the entire inference step proceeds only as fast as the slowest GPU. A single GPU experiencing thermal throttling will force all healthy GPUs to wait at NVLink synchronization barriers, causing severe, system-wide latency spikes.
  • Hardware Profiling Realities vs. Theoretical Models. The paper demonstrates how assumptions about quantization can mislead capacity planning. For instance, while gpt-oss-120b stores expert weights in MXFP4 (4-bit), vLLM on H100s unpacks them to BF16 in SM registers before matrix multiplication (W4A16). Assuming the model runs entirely in FP4 leads to mispredicting the bottleneck regime, a reality confirmed by our kernel profiling.

Read the White Paper

We cannot claim to know the optimal number of users for your deployment; each deployment has a unique combination of model, hardware, workload mix, and latency targets that produce different target numbers. The value derived from our research is in the methodology detailed in our white paper: a repeatable process for finding your own answer with confidence.

The full paper is available here: SPOC: a Stateful, Profile-based Optimization for LLM Capacity Planning Methodology.

We would love to hear how it goes if you adapt the framework to your own setup. The best benchmarks are the ones that reflect your actual users.


Discover more from VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.