Video wall with small screens digital concept..Reference Earth Map taken from open source: http://visibleearth.nasa.gov/view_rec.php?vev1id=11656 .Software used: 3dsMax.Date of creation (rendered) - 26.08.2011.All layers used
Home Page Private AI Technical VMware Private AI

LLM Inference Sizing and Performance Guidance

When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline on VMware Private AI Foundation with NVIDIA [1], you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For instance:

  • What is the maximum number of concurrent requests that can be supported for a specific Large Language Model (LLM) on a specific GPU?
  • What is the maximum sequence length (or prompt size) that a user can send to the chat app without experiencing a noticeably slow response time?
  • What is the estimated response time (latency) for generating output tokens, and how does it vary with different input sizes and LLM sizes?

Conversely, if you have specific capacity or latency requirements for utilizing LLMs with X billion parameters, you may wonder:

  • What type and quantity of GPUs should you acquire to meet these requirements?

This blog post aims to help answer these questions and guide your inference deployment planning. Please note that the calculations presented below are simplified estimates of inference time, as they do not account for additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more. Therefore, the numbers calculated below should be considered best-case estimates of the actual inference time.

For the remainder of this blog, we will use the following terminology:

  • “Prompt” refers to the input sequence sent to the LLM.
  • “Response” denotes the output sequence generated by the LLM.

Understand Your GPU’s Specifications

When selecting a GPU for your deployment, we distill the following key parameters to consider: FP16 (we decide to not use the sparsity value to estimate), GPU memory size, and GPU memory bandwidth. Below, we provide a list of recent GPUs with their relevant specifications in Table 1. Since most LLMs utilize half-precision floating-point arithmetic (FP16), we focus on the FP16 Tensor Core capabilities of each GPU. 

GPUFP16 (TFLOPS)GPU Memory (GB)Memory Bandwidth (GB/s)
A1012524600
A3033024933
L4018148864
L40s36248864
A100 40 GB312401555
A100 40 GB SXM312401555
A100 80 GB PCIe312801935
A100 80 GB SXM312802039
H100 PCIe1513802000
H100 SXM1979803350
H100 NVL39581887800
Table 1. GPU_Specification

Important Consideration for Models Trained on Non-FP16 Data Formats

Some models are trained using alternative data formats like BF16 (Bfloat16), FP8, or INT8. While most modern GPUs, such as the L40, A100, and H100 series, offer the same TFLOPS performance for both FP16 and BF16, there are exceptions. For instance, the A30 GPU has different TFLOPS values for FP16 and BF16.

If you plan to use a model trained on one of these specific data formats, it’s crucial to replace the FP16 column in Table 1 with the corresponding TFLOPS value of your GPU, e.g., BF16, FP8, or INT8. Additionally, when working with quantized models using FP8 or INT8, be sure to use 1 Byte for memory-related calculations in the formulas below to ensure accurate results.

Understand Your LLM’s Specifications

Table 2 presents specifications for recent large language models (LLMs), with the first six columns provided by the model producers. The last column, KV cache size (GiB/token) for each model, is calculated based on the values in the preceding columns. Table 2 includes six models, with the Mixtral-8x7B-v0.1 model [5] being the only Sparse Mixture of Experts (SMoE) architecture. This model has a total of 46.7 billion parameters, but during inference, its internal router utilizes only 2 experts, resulting in an active parameter count of 12.7 billion.

ModelParams (Billion)dimension of the model (d_model)number of attention heads (n_heads)number of times the attention block shows up (n_layers)Max Context window (N)kv_cache_size (GiB/token)
Llama-3 8B84096323281920.00049
Llama-3 70B708192648081920.00244
Llama-3.1 8B8409632321310720.00049
Llama-3.1-70B70819264801310720.00244
Mistral-7B v0.3740963232327680.00049
Mixtral-8x7B-v0.147 (total)
13 (active)
40963232327680.00098
Table 2. LLM_Specification

Calculating KV Cache Size per token for each model

At the core of every LLM lies the transformer engine, which consists of two distinct phases: prefill and autoregressive sampling.

  • In the prefill phase, the model processes the input prompt tokens in parallel, populating the key-value (KV) cache. The KV cache serves as the model’s state, embedded within the attention operation. During this phase, no tokens are generated.
  • In the autoregressive sampling phase, the model leverages the current state stored in the KV cache to sample and decode the next token. By reusing the KV cache, we avoid the computational overhead of recalculating the cache for every new token. This approach enables faster sampling, as we don’t need to pass all previously seen tokens through the model.

For a detailed explanation of the formula and its derivation, please refer to [2] and [3].

Note: As for the Mixtral-8x7b-v0.1 model, due to its sparse nature, its internal router uses 2 out of 8 expert models for inference, thus we need to additionally multiply by num_experts resulting in 0.00049 GiB/token × 2 experts = 0.00098 bytes/token.

Calculate the Memory Footprint for a Specific Model

Now using the kv_cache_size_per_token value,  you can also estimate the memory footprint required to support n_concurrent_request and average_context_window sizes. The formula to estimate the memory footprint is:

Note: For the Mixtral-8x7b-v0.1 model, we need to use the model total weights (47B) in the above formula.

With the estimated memory footprint, you can use this information to determine the type and number of GPUs required to meet your needs for VMware Private AI with NVIDIA.

Calculate Estimated Capacity  

All state-of-the-art LLM models are memory-bound [2, 4], meaning that their performance is limited by the memory access bandwidth rather than the computational capabilities of the GPU, so batching multiple prompts as the KV cache is a common way to improve throughput.

Based on the above two steps, we can estimate the theoretical maximum number of tokens that can be processed by one or more GPUs. This is calculated using the following formula:

ModelParams (Billion)1× 48G L40/L40s1× 80G A100/H1002× 48G L40/L40s2× 80G A100/H1004× 48G L40/L40s4× 80G A100/H1008× 48G L40/L40s8× 80G A100/H100
Llama-3-8B8655361310721638402949123604486225927536641277952
Llama-3-70B70OOMOOMOOM8192212997372899942204800
Llama-3.1-8B8655361310721638402949123604486225927536641277952
Llama-3.1-70B70OOMOOMOOM8192212997372899942204800
Mistral-7B-v0.37696321351681679362990083645446266887577601282048
Mixtral-8x7B-v0.147OOMOOMOOM67584100352231424296960559104
Table 3. Max number of tokens allowed in the KV cache for a model with different numbers and types of GPUs

In Table 3, “OOM” stands for Out of Memory, i.e., a particular GPU does not have sufficient resources to handle a specific model.  To overcome this limitation, we assume that a single model is distributed across multiple GPUs using parallelism, e.g., tensor parallelism. This allows us to handle larger models by splitting the model’s weights across multiple GPUs, reducing the memory requirements for each GPU. For example, when holding host llama-3-70b models, at least 70 B prams × 2 Bytes = 140 GB for model weights is required, thus we need at minimum 7× A10/A30-24GB GPUs, or 4× L40/L40s-48GB GPUs, or 4× A100-40GB GPUs, or 2× A100/H100-80GB GPUs to hold it. Thus, each GPU contains 70 ÷ 7 = 10B, 70 ÷ 4 = 17.5B, or 70 ÷ 2 = 35B parameters. 

After you get the maximum number of tokens allowed in the KV cache,  assuming the longest acceptable prompt length of a model, we can estimate the maximum number of concurrent requests that can be handled by each LLM. This represents the worst-case scenario, as it’s unlikely that all concurrent users will utilize the maximum context length.  However, this calculation provides a useful upper bound, helping you understand the limitations of your underlying system.

ModelMax Context Window1× 48G L40/L40s1× 80G A100/H1002× 48G L40/L40s2× 80G A100/H1004× 48G L40/L40s4× 80G A100/H1008× 48G L40/L40s8× 80G A100/H100
Llama-3-8B81928162036447692156
Llama-3-70B8192OOMOOMOOM1391225
Llama-3.1-8B131072111235610
Llama-3.1-70B131072OOMOOMOOM00112
Mistral-7B-v0.332768245911192339
Mixtral-8x7B-v0.132768OOMOOMOOM237917
Table 4. Max concurrent requests when using the largest available context window for a prompt

In real-world use cases, average prompt sizes are often shorter, resulting in higher concurrent throughput. For instance, if the average context window is 4096 tokens, you can expect a significant increase in concurrent requests compared to using the maximum acceptable prompt length.

ModelContext Window1× 48G L40/L40s1× 80G A100/H1002× 48G L40/L40s2× 80G A100/H1004× 48G L40/L40s4× 80G A100/H1008× 48G L40/L40s8× 80G A100/H100
Llama-3-8B40961632407288152184312
Llama-3-70B4096OOMOOMOOM25182450
Llama-3.1-8B40961632407288152184312
Llama-3.1-70B4096OOMOOMOOM25182450
Mistral-7B-v0.340961733417389153185313
Mixtral-8x7B-v0.14096OOMOOMOOM17255773137
Table 5. Max concurrent requests when using an average context window of 4096  tokens

Calculate Load for Multiple Model Replicas

When calculating the load for multiple replicas of a model, you can use the first available number that indicates the model can be held in a GPU setup. 

For instance, consider the case where you want to deploy two replicas of the Llama-3-8B model on 2x L40 GPUs, with load balancing between them. In this case, each L40 GPU contains a whole-weight copy of the Llama-3-8B model, rather than distributing the weights among GPUs.

To calculate the total number of concurrent requests, you can multiply the number of replicas by the number of concurrent prompts supported by each replica. For example, assuming a prompt size of 4k, the calculation would be:

2 replicas × 16 concurrent prompts per replica = 32 concurrent requests

Calculate Estimated Latency

The total time to solution, which is the time it takes to receive a response, consists of two main components:

  • Prefill Time: The time it takes to process the input prompt, which includes the time to populate the key-value (KV) cache.
  • Token Generation Time: The time it takes to generate each new token in the response.

Note: The calculations presented below provide simplified estimates of inference time, which should be considered as lower bounds (or best estimates), as they do not account for various additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more, depending on the specific system configuration and workload. 

Prefill_time_per_token on each GPU

The prefill section assumes that we batch all of the prompt tokens into a single forward pass. For simplicity, we assume the limiting bottleneck is compute, and not memory. Thus, we can use the following formula to estimate the prefill time (in milliseconds) required to process a single token in a prompt.

When using parallelism to deploy a model (e.g., tensor parallelism), weights are evenly distributed among multi-GPUs, thus we should use the portion of parameters in each GPU to estimate. In Table 6, for 70B and 8x7B models, we show the minimum number of GPUs required to hold them. Specifically, for 8x7B models, we use the “active” 13 billion parameters used by two experts for estimation.

Model Size 7B8B70B and Minimum GPUs Required8x7B and Minimum GPUs Required
A100.1120.1280.16070.2085
A300.0420.0480.06170.0795
L400.0770.0880.19340.1443
L40s 0.0390.0440.09740.0723
A100 40 GB0.0450.0510.11240.0833
A100 40 GB SXM0.0450.0510.11240.0833
A100 80 GB PCIe0.0450.0510.22420.0832
A100 80 GB SXM0.0450.0510.22420.0832
H100 PCIe0.0090.0110.04620.0172
H100 SXM0.0070.0080.03520.0132
H100 NVL0.0040.0040.01820.0072
Table 6. Prefill time (ms) to compute a single token of different model sizes in the prompt for GPUs

Generation_time_per_token on each GPU

The autoregressive part of generation is memory-bound [3]. We can estimate the time (in milliseconds) required to generate a single token in the response by the following formula. This metric is also known as Time Per Output Token (TPOT).

Similarly in Table 7, we use the portion of parameters allocated to each GPU to estimate sizing and performance for 70B and 8x7B models.

Model sizes7B8B70B and Minimum GPUs Required8x7B and Minimum GPUs Required
A1023.326.733.3743.35
A3015.017.121.4727.95
L4016.218.540.5430.13
L40s16.218.540.5430.13
A100 40 GB9.010.322.5416.73
A100 40 GB SXM9.010.322.5416.73
A100 80 GB PCIe7.28.336.2213.42
A100 80 GB SXM6.97.834.3212.82
H100 PCIe7.08.035.0213.02
H100 SXM4.24.820.927.82
H100 NVL1.82.19.023.32
Table 7. Generation_time_per_token (ms) of different model sizes in the response for GPUs

The Generation_time_per_token is useful to estimate whether it meets your requirements of time-to-first-token (TTFT) and Token-per-second (TPS) or inter-token-latency (ITL).

To deliver a seamless user experience in chat-type applications, it’s essential to achieve a time-to-first-token (TTFT) below the average human visual reaction time of 200 milliseconds. This metric is also impacted by other factors, such as network speed, prompt length, and model size. Consider these factors that will impact your real values on your infrastructure.

A fast TTFT is only half the equation; a high tokens-per-second (TPS) is equally important for real-time applications like chat. Typically, a TPS of 30 or higher is recommended. This corresponds to an ITL of around 33.3 milliseconds when using streaming output. To put this into perspective, consider the average reading speed of humans. The average reading speed is estimated to be between 200-300 words per minute, with exceptional readers reaching up to 1,000 words per minute. In comparison, a model generating 30 tokens per second (approximately 90 words per second) can produce around 1,350 words per minute. This is significantly faster than even the fastest human readers, making it more than capable of keeping pace with their reading speed.

Total latency of a request

We can calculate the estimated total time for a single prompt and its response by the following formula.

Model sizes7B8B70B and Minimum GPUs Required8x7B and Minimum GPUs Required
A106.47.39.2711.95
A304.04.65.777.45
L404.55.111.148.33
L40s4.34.910.848.03
A100 40 GB2.52.86.244.63
A100 40 GB SXM2.52.86.244.63
A100 80 GB PCIe2.02.310.223.82
A100 80 GB SXM1.92.29.723.62
H100 PCIe1.82.19.123.42
H100 SXM1.11.35.522.02
H100 NVL0.50.52.420.92
Table 8. Estimated Latency (s) of prompt_size = 4000 and response_size = 256

Use the Estimation Calculator Script

Now that you have a solid understanding of the key factors that impact LLM inference performance, you can leverage the provided Python script to estimate the memory footprint, capacity, and latency on VMware Private Foundation AI with NVIDIA.
To get started, this script allows users to customize its behavior by providing input values for the following parameters to reflect your desired configuration:

  • num_gpu (-g): Specify the number of GPUs you plan to use for your deployment.
  • prompt_sz (-p): Define the average size of the input prompts you expect to process.
  • response_sz (-r): Set the average size of the responses you expect to generate.
  • n_concurrent_req (-c): Indicate the number of concurrent requests you anticipate handling.
  • avg_context_window (-w): Specify the average context window size for your use case.

By modifying these input values or inserting your target models in the source code, you can easily estimate the performance characteristics of your LLM deployment and make informed decisions about your infrastructure requirements.

Conclusion

In this blog post, we have provided a comprehensive guide to help you plan and deploy Large Language Models (LLMs) on VMware Private AI Foundation with NVIDIA. We have discussed the key factors that impact LLM inference performance, including GPU specifications and model specifications. We have also provided a set of formulas, tables, and a Python script to help you estimate the memory footprint, capacity, and latency of your LLM deployment based on your requirements. We hope that this blog post helps to guide you deliver a seamless user experience in chat-type applications on VMware Private AI Foundation with NVIDIA.

Reference

[1] VMware Private AI Foundation with NVIDIA 

[2] A guide to LLM inference and performance

[3] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[4] Efficient Memory Management for Large Language Model Serving with PagedAttention

[5] Mixtral 8x7B

Acknowledgments

The author thanks Ramesh Radhakrishnan, Frank Dennemen, Rick Battle, Jessiely Consolacion, Shobhit Bhutani, and Roger Fortier from Broadcom’s VMware Cloud Foundation Division for reviewing and improving the paper.