When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline on VMware Private AI Foundation with NVIDIA (PAIF-N) [1], you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For instance:
- What is the maximum number of concurrent requests that can be supported for a specific Large Language Model (LLM) on a specific GPU?
- What is the maximum sequence length (or prompt size) that a user can send to the chat app without experiencing a noticeably slow response time?
- What is the estimated response time (latency) for generating output tokens, and how does it vary with different input sizes and LLM sizes?
Conversely, if you have specific capacity or latency requirements for utilizing LLMs with X billion parameters, you may wonder:
- What type and quantity of GPUs should you acquire to meet these requirements?
This blog post aims to help answer these questions and guide your inference deployment planning. Please note that the calculations presented below are simplified estimates of inference time, as they do not account for additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more. Therefore, the numbers calculated below should be considered best-case estimates of the actual inference time.
For the remainder of this blog, we will use the following terminology:
- “Prompt” refers to the input sequence sent to the LLM.
- “Response” denotes the output sequence generated by the LLM.
Understand Your GPU’s Specifications
When selecting a GPU for your deployment, we distill the following key parameters to consider: FP16 (we decide to not use the sparsity value to estimate), GPU memory size, and GPU memory bandwidth. Below, we provide a list of recent GPUs with their relevant specifications in Table 1. Since most LLMs utilize half-precision floating-point arithmetic (FP16), we focus on the FP16 Tensor Core capabilities of each GPU.
GPU | FP16 (TFLOPS) | GPU Memory (GB) | Memory Bandwidth (GB/s) |
---|---|---|---|
A10 | 125 | 24 | 600 |
A30 | 330 | 24 | 933 |
L40 | 181 | 48 | 864 |
L40s | 362 | 48 | 864 |
A100 40 GB | 312 | 40 | 1555 |
A100 40 GB SXM | 312 | 40 | 1555 |
A100 80 GB PCIe | 312 | 80 | 1935 |
A100 80 GB SXM | 312 | 80 | 2039 |
H100 PCIe | 1513 | 80 | 2000 |
H100 SXM | 1979 | 80 | 3350 |
H100 NVL | 3958 | 188 | 7800 |
Some models are trained using BF16 (Bfloat16) data format or FP8 by quantized modes. Although most modern GPUs, such as the L40 series, A100 series, and H100 series, have the same TFLOPS for both FP16 and BF16, there are some exceptions. For example, the A30 GPU has different TFLOPS values for FP16 and BF16. If you plan to use a model that is trained on a specific data format, it’s essential to replace the FP16 column with the corresponding TFLOPS value for your calculations.
Understand Your LLM’s Specifications
Table 2 presents specifications for recent large language models (LLMs), with the first six columns provided by the model producers. The last column, KV cache size (GiB/token) for each model, is calculated based on the values in the preceding columns. Table 2 includes six models, with the Mixtral-8x7B-v0.1 model [5] being the only Sparse Mixture of Experts (SMoE) architecture. This model has a total of 46.7 billion parameters, but during inference, its internal router utilizes only 2 experts, resulting in an active parameter count of 12.7 billion.
Model | Params (Billion) | dimension of the model (d_model) | number of attention heads (n_heads) | number of times the attention block shows up (n_layers) | Max Context window (N) | kv_cache_size (GiB/token) |
---|---|---|---|---|---|---|
Llama-3 8B | 8 | 4096 | 32 | 32 | 8192 | 0.00049 |
Llama-3 70B | 70 | 8192 | 64 | 80 | 8192 | 0.00244 |
Llama-3.1 8B | 8 | 4096 | 32 | 32 | 131072 | 0.00049 |
Llama-3.1-70B | 70 | 8192 | 64 | 80 | 131072 | 0.00244 |
Mistral-7B v0.3 | 7 | 4096 | 32 | 32 | 32768 | 0.00049 |
Mixtral-8x7B-v0.1 | 47 (total) 13 (active) |
4096 | 32 | 32 | 32768 | 0.00098 |
Calculating KV Cache Size per token for each model
At the core of every LLM lies the transformer engine, which consists of two distinct phases: prefill and autoregressive sampling.
- In the prefill phase, the model processes the input prompt tokens in parallel, populating the key-value (KV) cache. The KV cache serves as the model’s state, embedded within the attention operation. During this phase, no tokens are generated.
- In the autoregressive sampling phase, the model leverages the current state stored in the KV cache to sample and decode the next token. By reusing the KV cache, we avoid the computational overhead of recalculating the cache for every new token. This approach enables faster sampling, as we don’t need to pass all previously seen tokens through the model.
kv_cache_size_per_token = (2 × 2 × n_layers × d_model) bytes/token = (4 × 32 × 4096) bytes/token = 524288 bytes/token ~ 0.00049 GiB/token for Llama-3-8b
For a detailed explanation of the formula and its derivation, please refer to [2] and [3].
Note: As for the Mixtral-8x7b-v0.1 model, due to its sparse nature, its internal router uses 2 out of 8 expert models for inference, thus we need to additionally multiply by num_experts resulting in 0.00049 GiB/token × 2 experts = 0.00098 bytes/token
.
Calculate the Memory Footprint for a Specific Model
Now using the kv_cache_size_per_token value, you can also estimate the memory footprint required to support n_concurrent_request and average_context_window sizes. The formula to estimate the memory footprint is:
GPU_memory_foot_print = model_weights_size + kv_cache_size = num_model_param × size_fp16 + kv_cache_size_per_token × avg_context_window × n_concurrent_request = 8B prams × 2 Byte + 0.00049 GiB/token × 1024 tokens × 10 requests = 21 GB for llama-3-8B with average 1024 tokens in context window and 10 concurrent requests
Note: For the Mixtral-8x7b-v0.1 model, we need to use the model total weights (47B) in the above formula.
With the estimated memory footprint, you can use this information to determine the type and number of GPUs required to meet your needs for VMware Private AI with NVIDIA.
Calculate Estimated Capacity
All state-of-the-art LLM models are memory-bound [2, 4], meaning that their performance is limited by the memory access bandwidth rather than the computational capabilities of the GPU, so batching multiple prompts as the KV cache is a common way to improve throughput.
Based on the above two steps, we can estimate the theoretical maximum number of tokens that can be processed by one or more GPUs. This is calculated using the following formula:
kv_cache_tokens = Remained_GPU_Mem ÷ kv_cache_size = (Total_GPU_Mem - Model's weights) ÷ kv_cache_size_per_token = (48 GB - 2 Bytes × 8 B params) ÷ 0.00049 GB/token = 65536 tokens for Llama-3-8b
Model | Params (Billion) | 1× 48G L40/L40s | 1× 80G A100/H100 | 2× 48G L40/L40s | 2× 80G A100/H100 | 4× 48G L40/L40s | 4× 80G A100/H100 | 8× 48G L40/L40s | 8× 80G A100/H100 |
---|---|---|---|---|---|---|---|---|---|
Llama-3-8B | 8 | 65536 | 131072 | 163840 | 294912 | 360448 | 622592 | 753664 | 1277952 |
Llama-3-70B | 70 | OOM | OOM | OOM | 8192 | 21299 | 73728 | 99942 | 204800 |
Llama-3.1-8B | 8 | 65536 | 131072 | 163840 | 294912 | 360448 | 622592 | 753664 | 1277952 |
Llama-3.1-70B | 70 | OOM | OOM | OOM | 8192 | 21299 | 73728 | 99942 | 204800 |
Mistral-7B-v0.3 | 7 | 69632 | 135168 | 167936 | 299008 | 364544 | 626688 | 757760 | 1282048 |
Mixtral-8x7B-v0.1 | 47 | OOM | OOM | OOM | 67584 | 100352 | 231424 | 296960 | 559104 |
In Table 3, “OOM” stands for Out of Memory, i.e., a particular GPU does not have sufficient resources to handle a specific model. To overcome this limitation, we assume that a single model is distributed across multiple GPUs using parallelism, e.g., tensor parallelism. This allows us to handle larger models by splitting the model’s weights across multiple GPUs, reducing the memory requirements for each GPU. For example, when holding host llama-3-70b models, at least 70 Billion × 2 Bytes/parameter = 140 GB
for model weights is required, thus we need at minimum 7× A10/A30-24GB GPUs, or 4× L40/L40s-48GB GPUs, or 4× A100-40GB GPUs, or 2× A100/H100-80GB GPUs to hold it. Thus, each GPU contains 70 ÷ 7 = 10B
, 70 ÷ 4 = 17.5B
, or 70 ÷ 2 = 35B
parameters.
After you get the maximum number of tokens allowed in the KV cache, assuming the longest acceptable prompt length of a model, we can estimate the maximum number of concurrent requests that can be handled by each LLM. This represents the worst-case scenario, as it’s unlikely that all concurrent users will utilize the maximum context length. However, this calculation provides a useful upper bound, helping you understand the limitations of your underlying system.
Concurrent_requests_with_the_longest_prompt_size = Max_tokens_in_KV_cache ÷ maximum_context_window = 65536 tokens ÷ 8192 tokens per request = 8 concurrent prompts for Llama-3-8b
Model | Max Context Window | 1× 48G L40/L40s | 1× 80G A100/H100 | 2× 48G L40/L40s | 2× 80G A100/H100 | 4× 48G L40/L40s | 4× 80G A100/H100 | 8× 48G L40/L40s | 8× 80G A100/H100 |
---|---|---|---|---|---|---|---|---|---|
Llama-3-8B | 8192 | 8 | 16 | 20 | 36 | 44 | 76 | 92 | 156 |
Llama-3-70B | 8192 | OOM | OOM | OOM | 1 | 3 | 9 | 12 | 25 |
Llama-3.1-8B | 131072 | 1 | 1 | 1 | 2 | 3 | 5 | 6 | 10 |
Llama-3.1-70B | 131072 | OOM | OOM | OOM | 0 | 0 | 1 | 1 | 2 |
Mistral-7B-v0.3 | 32768 | 2 | 4 | 5 | 9 | 11 | 19 | 23 | 39 |
Mixtral-8x7B-v0.1 | 32768 | OOM | OOM | OOM | 2 | 3 | 7 | 9 | 17 |
In real-world use cases, average prompt sizes are often shorter, resulting in higher concurrent throughput. For instance, if the average context window is 4096 tokens, you can expect a significant increase in concurrent requests compared to using the maximum acceptable prompt length.
Model | Context Window | 1× 48G L40/L40s | 1× 80G A100/H100 | 2× 48G L40/L40s | 2× 80G A100/H100 | 4× 48G L40/L40s | 4× 80G A100/H100 | 8× 48G L40/L40s | 8× 80G A100/H100 |
---|---|---|---|---|---|---|---|---|---|
Llama-3-8B | 4096 | 16 | 32 | 40 | 72 | 88 | 152 | 184 | 312 |
Llama-3-70B | 4096 | OOM | OOM | OOM | 2 | 5 | 18 | 24 | 50 |
Llama-3.1-8B | 4096 | 16 | 32 | 40 | 72 | 88 | 152 | 184 | 312 |
Llama-3.1-70B | 4096 | OOM | OOM | OOM | 2 | 5 | 18 | 24 | 50 |
Mistral-7B-v0.3 | 4096 | 17 | 33 | 41 | 73 | 89 | 153 | 185 | 313 |
Mixtral-8x7B-v0.1 | 4096 | OOM | OOM | OOM | 17 | 25 | 57 | 73 | 137 |
Calculate Load for Multiple Model Replicas
When calculating the load for multiple replicas of a model, you can use the first available number that indicates the model can be held in a GPU setup.
For instance, consider the case where you want to deploy two replicas of the Llama-3-8B model on 2x L40 GPUs, with load balancing between them. In this case, each L40 GPU contains a whole-weight copy of the Llama-3-8B model, rather than distributing the weights among GPUs.
To calculate the total number of concurrent requests, you can multiply the number of replicas by the number of concurrent prompts supported by each replica. For example, assuming a prompt size of 4k, the calculation would be:
2 replicas × 16 concurrent prompts per replica = 32 concurrent requests
Calculate Estimated Latency
The total time to solution, which is the time it takes to receive a response, consists of two main components:
- Prefill Time: The time it takes to process the input prompt, which includes the time to populate the key-value (KV) cache.
- Token Generation Time: The time it takes to generate each new token in the response.
Note: The calculations presented below provide simplified estimates of inference time, which should be considered as lower bounds (or best estimates), as they do not account for various additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more, depending on the specific system configuration and workload.
Prefill_time_per_token on each GPU
The prefill section assumes that we batch all of the prompt tokens into a single forward pass. For simplicity, we assume the limiting bottleneck is compute, and not memory. Thus, we can use the following formula to estimate the prefill time (in milliseconds) required to process a single token in a prompt.
Prefill_time_per_token = number of LLM's parameters per GPU ÷ accelerator compute bandwidth = (2 × 8B) FLOP ÷ 864 GB/s = 0.077 ms for llama-3-8b on L40
When using parallelism to deploy a model (e.g., tensor parallelism), weights are evenly distributed among multi-GPUs, thus we should use the portion of parameters in each GPU to estimate. In Table 6, for 70B and 8x7B models, we show the minimum number of GPUs required to hold them. Specifically, for 8x7B models, we use the “active” 13 billion parameters used by two experts for estimation.
Model Size | 7B | 8B | 70B and Minimum GPUs Required | 8x7B and Minimum GPUs Required | ||
A10 | 0.112 | 0.128 | 0.160 | 7 | 0.208 | 5 |
A30 | 0.042 | 0.048 | 0.061 | 7 | 0.079 | 5 |
L40 | 0.077 | 0.088 | 0.193 | 4 | 0.144 | 3 |
L40s | 0.039 | 0.044 | 0.097 | 4 | 0.072 | 3 |
A100 40 GB | 0.045 | 0.051 | 0.112 | 4 | 0.083 | 3 |
A100 40 GB SXM | 0.045 | 0.051 | 0.112 | 4 | 0.083 | 3 |
A100 80 GB PCIe | 0.045 | 0.051 | 0.224 | 2 | 0.083 | 2 |
A100 80 GB SXM | 0.045 | 0.051 | 0.224 | 2 | 0.083 | 2 |
H100 PCIe | 0.009 | 0.011 | 0.046 | 2 | 0.017 | 2 |
H100 SXM | 0.007 | 0.008 | 0.035 | 2 | 0.013 | 2 |
H100 NVL | 0.004 | 0.004 | 0.018 | 2 | 0.007 | 2 |
Generation_time_per_token on each GPU
The autoregressive part of generation is memory-bound [3]. We can estimate the time (in milliseconds) required to generate a single token in the response by the following formula. This metric is also known as Time Per Output Token (TPOT).
Generation_time_per_token = number of bytes moved (the model weights) per GPU ÷ accelerator memory bandwidth = (2 × 8B) FLOP ÷ 181 TFLOP/s = 16.2 ms for llama-3-8b on L40
Similarly in Table 7, we use the portion of parameters allocated to each GPU to estimate sizing and performance for 70B and 8x7B models.
Model sizes | 7B | 8B | 70B and Minimum GPUs Required | 8x7B and Minimum GPUs Required | ||
A10 | 23.3 | 26.7 | 33.3 | 7 | 43.3 | 5 |
A30 | 15.0 | 17.1 | 21.4 | 7 | 27.9 | 5 |
L40 | 16.2 | 18.5 | 40.5 | 4 | 30.1 | 3 |
L40s | 16.2 | 18.5 | 40.5 | 4 | 30.1 | 3 |
A100 40 GB | 9.0 | 10.3 | 22.5 | 4 | 16.7 | 3 |
A100 40 GB SXM | 9.0 | 10.3 | 22.5 | 4 | 16.7 | 3 |
A100 80 GB PCIe | 7.2 | 8.3 | 36.2 | 2 | 13.4 | 2 |
A100 80 GB SXM | 6.9 | 7.8 | 34.3 | 2 | 12.8 | 2 |
H100 PCIe | 7.0 | 8.0 | 35.0 | 2 | 13.0 | 2 |
H100 SXM | 4.2 | 4.8 | 20.9 | 2 | 7.8 | 2 |
H100 NVL | 1.8 | 2.1 | 9.0 | 2 | 3.3 | 2 |
The Generation_time_per_token is useful to estimate whether it meets your requirements of time-to-first-token (TTFT) and Token-per-second (TPS) or inter-token-latency (ITL).
To deliver a seamless user experience in chat-type applications, it’s essential to achieve a time-to-first-token (TTFT) below the average human visual reaction time of 200 milliseconds. This metric is also impacted by other factors, such as network speed, prompt length, and model size. Consider these factors that will impact your real values on your infrastructure.
A fast TTFT is only half the equation; a high tokens-per-second (TPS) is equally important for real-time applications like chat. Typically, a TPS of 30 or higher is recommended. This corresponds to an ITL of around 33.3 milliseconds when using streaming output. To put this into perspective, consider the average reading speed of humans. The average reading speed is estimated to be between 200-300 words per minute, with exceptional readers reaching up to 1,000 words per minute. In comparison, a model generating 30 tokens per second (approximately 90 words per second) can produce around 1,350 words per minute. This is significantly faster than even the fastest human readers, making it more than capable of keeping pace with their reading speed.
Total latency of a request
We can calculate the estimated total time for a single prompt and its response by the following formula.
Estimated response time = prefill_time + generation time = prompt_size × prefill_time_per_token + response_size × generation_time_per_token = 4000 × 0.077 + 256 × 16.2 ms = 4.5 s
Model sizes | 7B | 8B | 70B and Minimum GPUs Required | 8x7B and Minimum GPUs Required | ||
A10 | 6.4 | 7.3 | 9.2 | 7 | 11.9 | 5 |
A30 | 4.0 | 4.6 | 5.7 | 7 | 7.4 | 5 |
L40 | 4.5 | 5.1 | 11.1 | 4 | 8.3 | 3 |
L40s | 4.3 | 4.9 | 10.8 | 4 | 8.0 | 3 |
A100 40 GB | 2.5 | 2.8 | 6.2 | 4 | 4.6 | 3 |
A100 40 GB SXM | 2.5 | 2.8 | 6.2 | 4 | 4.6 | 3 |
A100 80 GB PCIe | 2.0 | 2.3 | 10.2 | 2 | 3.8 | 2 |
A100 80 GB SXM | 1.9 | 2.2 | 9.7 | 2 | 3.6 | 2 |
H100 PCIe | 1.8 | 2.1 | 9.1 | 2 | 3.4 | 2 |
H100 SXM | 1.1 | 1.3 | 5.5 | 2 | 2.0 | 2 |
H100 NVL | 0.5 | 0.5 | 2.4 | 2 | 0.9 | 2 |
Use the Estimation Calculator Script
Now that you have a solid understanding of the key factors that impact LLM inference performance, you can leverage the provided Python script to estimate the memory footprint, capacity, and latency on VMware Private Foundation AI with NVIDIA.
To get started, simply update the input variables in the script to reflect your desired configuration:
- num_gpu: Specify the number of GPUs you plan to use for your deployment.
- prompt_sz: Define the average size of the input prompts you expect to process.
- response_sz: Set the average size of the responses you expect to generate.
- n_concurrent_req: Indicate the number of concurrent requests you anticipate handling.
- avg_context_window: Specify the average context window size for your use case.
By modifying these variables, you can easily estimate the performance characteristics of your LLM deployment and make informed decisions about your infrastructure requirements.
✗ python LLM_size_pef_calculator.py -g 1 -p 4096 -r 256 -c 10 -w 1024 num_gpu = 1, prompt_size = 4096 tokens, response_size = 256 tokens n_concurrent_request = 10, avg_context_window = 1024 tokens ******************** Estimate LLM Memory Footprint ******************** | Model | KV Cache Size per Token | Memory Footprint | |-----------------+---------------------------+--------------------| | Llama-3-8B | 0.000488 GiB/token | 21.00 GB | | Llama-3-70B | 0.002441 GiB/token | 165.00 GB | | Llama-3.1-8B | 0.000488 GiB/token | 21.00 GB | | Llama-3.1-70B | 0.002441 GiB/token | 165.00 GB | | Mistral-7B-v0.3 | 0.000488 GiB/token | 19.00 GB | ******************** Estimate LLM Capacity and Latency ******************** | Model | GPU | KV Cache Tokens | Prefill Time | Generation Time | Estimated Response Time | |-----------------+-----------------+-------------------+----------------+-------------------+---------------------------| | Llama-3-8B | A10 | 16384.0 | 0.128 ms | 26.667 ms | 7.4 s | | Llama-3-8B | A30 | 16384.0 | 0.048 ms | 17.149 ms | 4.6 s | | Llama-3-8B | L40 | 65536.0 | 0.088 ms | 18.519 ms | 5.1 s | | Llama-3-8B | L40s | 65536.0 | 0.044 ms | 18.519 ms | 4.9 s | | Llama-3-8B | A100 40 GB | 49152.0 | 0.051 ms | 10.289 ms | 2.8 s | | Llama-3-8B | A100 40 GB SXM | 49152.0 | 0.051 ms | 10.289 ms | 2.8 s | | Llama-3-8B | A100 80 GB PCIe | 131072.0 | 0.051 ms | 8.269 ms | 2.3 s | | Llama-3-8B | A100 80 GB SXM | 131072.0 | 0.051 ms | 7.847 ms | 2.2 s | | Llama-3-8B | H100 PCIe | 131072.0 | 0.011 ms | 8.000 ms | 2.1 s | | Llama-3-8B | H100 SXM | 131072.0 | 0.008 ms | 4.776 ms | 1.3 s | | Llama-3-8B | H100 NVL | 352256.0 | 0.004 ms | 2.051 ms | 0.5 s | | Llama-3-70B | A10 | OOM | 1.120 ms | 233.333 ms | 64.3 s | | Llama-3-70B | A30 | OOM | 0.424 ms | 150.054 ms | 40.2 s | | Llama-3-70B | L40 | OOM | 0.773 ms | 162.037 ms | 44.6 s | | Llama-3-70B | L40s | OOM | 0.387 ms | 162.037 ms | 43.1 s | | Llama-3-70B | A100 40 GB | OOM | 0.449 ms | 90.032 ms | 24.9 s | | Llama-3-70B | A100 40 GB SXM | OOM | 0.449 ms | 90.032 ms | 24.9 s | | Llama-3-70B | A100 80 GB PCIe | OOM | 0.449 ms | 72.351 ms | 20.4 s | | Llama-3-70B | A100 80 GB SXM | OOM | 0.449 ms | 68.661 ms | 19.4 s | | Llama-3-70B | H100 PCIe | OOM | 0.093 ms | 70.000 ms | 18.3 s | | Llama-3-70B | H100 SXM | OOM | 0.071 ms | 41.791 ms | 11.0 s | | Llama-3-70B | H100 NVL | 19660.8 | 0.035 ms | 17.949 ms | 4.7 s |
Conclusion
In this blog post, we have provided a comprehensive guide to help you plan and deploy Large Language Models (LLMs) on VMware Private AI Foundation with NVIDIA. We have discussed the key factors that impact LLM inference performance, including GPU specifications and model specifications. We have also provided a set of formulas, tables, and a Python script to help you estimate the memory footprint, capacity, and latency of your LLM deployment based on your requirements. We hope that this blog post helps to guide you deliver a seamless user experience in chat-type applications on VMware Private AI Foundation with NVIDIA.
Reference
[1] VMware Private AI Foundation with NVIDIA
[2] A guide to LLM inference and performance
[3] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
[4] Efficient Memory Management for Large Language Model Serving with PagedAttention
[5] Mixtral 8x7B
Acknowledgments
The author thanks Ramesh Radhakrishnan, Frank Dennemen, Rick Battle, Jessiely Consolacion, Shobhit Bhutani, and Roger Fortier from Broadcom’s VMware Cloud Foundation Division for reviewing and improving the paper.
The post LLM Inference Sizing and Performance Guidance appeared first on VMware Cloud Foundation (VCF) Blog.