LLM Inference Sizing and Performance Guidance

When planning to deploy a chatbot or simple Retrieval-Augmentation-Generation (RAG) pipeline on VMware Private AI Foundation with NVIDIA (PAIF-N) [1], you may have questions about sizing (capacity) and performance based on your existing GPU resources or potential future GPU acquisitions. For instance:

What is the maximum number of concurrent requests that can be supported for a specific Large Language Model (LLM) on a specific GPU?
What is the maximum sequence length (or prompt size) that a user can send to the chat app without experiencing a noticeably slow response time?
What is the estimated response time (latency) for generating output tokens, and how does it vary with different input sizes and LLM sizes?

Conversely, if you have specific capacity or latency requirements for utilizing LLMs with X billion parameters, you may wonder:

What type and quantity of GPUs should you acquire to meet these requirements?

This blog post aims to help answer these questions and guide your inference deployment planning. Please note that the calculations presented below are simplified estimates of inference time, as they do not account for additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more. Therefore, the numbers calculated below should be considered best-case estimates of the actual inference time.

For the remainder of this blog, we will use the following terminology:

“Prompt” refers to the input sequence sent to the LLM.
“Response” denotes the output sequence generated by the LLM.

Understand Your GPU’s Specifications

When selecting a GPU for your deployment, we distill the following key parameters to consider: FP16 (we decide to not use the sparsity value to estimate), GPU memory size, and GPU memory bandwidth. Below, we provide a list of recent GPUs with their relevant specifications in Table 1. Since most LLMs utilize half-precision floating-point arithmetic (FP16), we focus on the FP16 Tensor Core capabilities of each GPU.

GPU	FP16 (TFLOPS)	GPU Memory (GB)	Memory Bandwidth (GB/s)
A10	125	24	600
A30	330	24	933
L40	181	48	864
L40s	362	48	864
A100 40 GB	312	40	1555
A100 40 GB SXM	312	40	1555
A100 80 GB PCIe	312	80	1935
A100 80 GB SXM	312	80	2039
H100 PCIe	1513	80	2000
H100 SXM	1979	80	3350
H100 NVL	3958	188	7800

Table 1. GPU_Specification

Some models are trained using BF16 (Bfloat16) data format or FP8 by quantized modes. Although most modern GPUs, such as the L40 series, A100 series, and H100 series, have the same TFLOPS for both FP16 and BF16, there are some exceptions. For example, the A30 GPU has different TFLOPS values for FP16 and BF16. If you plan to use a model that is trained on a specific data format, it’s essential to replace the FP16 column with the corresponding TFLOPS value for your calculations.

Understand Your LLM’s Specifications

Table 2 presents specifications for recent large language models (LLMs), with the first six columns provided by the model producers. The last column, KV cache size (GiB/token) for each model, is calculated based on the values in the preceding columns. Table 2 includes six models, with the Mixtral-8x7B-v0.1 model [5] being the only Sparse Mixture of Experts (SMoE) architecture. This model has a total of 46.7 billion parameters, but during inference, its internal router utilizes only 2 experts, resulting in an active parameter count of 12.7 billion.

Model	Params (Billion)	dimension of the model (d_model)	number of attention heads (n_heads)	number of times the attention block shows up (n_layers)	Max Context window (N)	kv_cache_size (GiB/token)
Llama-3 8B	8	4096	32	32	8192	0.00049
Llama-3 70B	70	8192	64	80	8192	0.00244
Llama-3.1 8B	8	4096	32	32	131072	0.00049
Llama-3.1-70B	70	8192	64	80	131072	0.00244
Mistral-7B v0.3	7	4096	32	32	32768	0.00049
Mixtral-8x7B-v0.1	47 (total) 13 (active)	4096	32	32	32768	0.00098

Table 2. LLM_Specification

Calculating KV Cache Size per token for each model

At the core of every LLM lies the transformer engine, which consists of two distinct phases: prefill and autoregressive sampling.

In the prefill phase, the model processes the input prompt tokens in parallel, populating the key-value (KV) cache. The KV cache serves as the model’s state, embedded within the attention operation. During this phase, no tokens are generated.
In the autoregressive sampling phase, the model leverages the current state stored in the KV cache to sample and decode the next token. By reusing the KV cache, we avoid the computational overhead of recalculating the cache for every new token. This approach enables faster sampling, as we don’t need to pass all previously seen tokens through the model.

kv_cache_size_per_token
    = (2 × 2 × n_layers × d_model) bytes/token
    = (4 × 32 × 4096) bytes/token
    = 524288 bytes/token
    ~ 0.00049 GiB/token for Llama-3-8b

For a detailed explanation of the formula and its derivation, please refer to [2] and [3].

Note: As for the Mixtral-8x7b-v0.1 model, due to its sparse nature, its internal router uses 2 out of 8 expert models for inference, thus we need to additionally multiply by num_experts resulting in 0.00049 GiB/token × 2 experts = 0.00098 bytes/token.

Calculate the Memory Footprint for a Specific Model

Now using the kv_cache_size_per_token value, you can also estimate the memory footprint required to support n_concurrent_request and average_context_window sizes. The formula to estimate the memory footprint is:

GPU_memory_foot_print 
    = model_weights_size + kv_cache_size
    = num_model_param × size_fp16 + kv_cache_size_per_token × avg_context_window × n_concurrent_request
    = 8B prams × 2 Byte + 0.00049 GiB/token × 1024 tokens × 10 requests
    = 21 GB for llama-3-8B with average 1024 tokens in context window and 10 concurrent requests

Note: For the Mixtral-8x7b-v0.1 model, we need to use the model total weights (47B) in the above formula.

With the estimated memory footprint, you can use this information to determine the type and number of GPUs required to meet your needs for VMware Private AI with NVIDIA.

Calculate Estimated Capacity

All state-of-the-art LLM models are memory-bound [2, 4], meaning that their performance is limited by the memory access bandwidth rather than the computational capabilities of the GPU, so batching multiple prompts as the KV cache is a common way to improve throughput.

Based on the above two steps, we can estimate the theoretical maximum number of tokens that can be processed by one or more GPUs. This is calculated using the following formula:

kv_cache_tokens 
    = Remained_GPU_Mem ÷ kv_cache_size
    = (Total_GPU_Mem - Model's weights) ÷ kv_cache_size_per_token
    = (48 GB - 2 Bytes × 8 B params) ÷ 0.00049 GB/token
    = 65536 tokens for Llama-3-8b

Model	Params (Billion)	1× 48G L40/L40s	1× 80G A100/H100	2× 48G L40/L40s	2× 80G A100/H100	4× 48G L40/L40s	4× 80G A100/H100	8× 48G L40/L40s	8× 80G A100/H100
Llama-3-8B	8	65536	131072	163840	294912	360448	622592	753664	1277952
Llama-3-70B	70	OOM	OOM	OOM	8192	21299	73728	99942	204800
Llama-3.1-8B	8	65536	131072	163840	294912	360448	622592	753664	1277952
Llama-3.1-70B	70	OOM	OOM	OOM	8192	21299	73728	99942	204800
Mistral-7B-v0.3	7	69632	135168	167936	299008	364544	626688	757760	1282048
Mixtral-8x7B-v0.1	47	OOM	OOM	OOM	67584	100352	231424	296960	559104

Table 3. Max number of tokens allowed in the KV cache for a model with different numbers and types of GPUs

In Table 3, “OOM” stands for Out of Memory, i.e., a particular GPU does not have sufficient resources to handle a specific model. To overcome this limitation, we assume that a single model is distributed across multiple GPUs using parallelism, e.g., tensor parallelism. This allows us to handle larger models by splitting the model’s weights across multiple GPUs, reducing the memory requirements for each GPU. For example, when holding host llama-3-70b models, at least 70 Billion × 2 Bytes/parameter = 140 GB for model weights is required, thus we need at minimum 7× A10/A30-24GB GPUs, or 4× L40/L40s-48GB GPUs, or 4× A100-40GB GPUs, or 2× A100/H100-80GB GPUs to hold it. Thus, each GPU contains 70 ÷ 7 = 10B, 70 ÷ 4 = 17.5B, or 70 ÷ 2 = 35B parameters.

After you get the maximum number of tokens allowed in the KV cache, assuming the longest acceptable prompt length of a model, we can estimate the maximum number of concurrent requests that can be handled by each LLM. This represents the worst-case scenario, as it’s unlikely that all concurrent users will utilize the maximum context length. However, this calculation provides a useful upper bound, helping you understand the limitations of your underlying system.

Concurrent_requests_with_the_longest_prompt_size
    = Max_tokens_in_KV_cache ÷ maximum_context_window
    = 65536 tokens ÷ 8192 tokens per request
    = 8 concurrent prompts for Llama-3-8b

Model	Max Context Window	1× 48G L40/L40s	1× 80G A100/H100	2× 48G L40/L40s	2× 80G A100/H100	4× 48G L40/L40s	4× 80G A100/H100	8× 48G L40/L40s	8× 80G A100/H100
Llama-3-8B	8192	8	16	20	36	44	76	92	156
Llama-3-70B	8192	OOM	OOM	OOM	1	3	9	12	25
Llama-3.1-8B	131072	1	1	1	2	3	5	6	10
Llama-3.1-70B	131072	OOM	OOM	OOM	0	0	1	1	2
Mistral-7B-v0.3	32768	2	4	5	9	11	19	23	39
Mixtral-8x7B-v0.1	32768	OOM	OOM	OOM	2	3	7	9	17

Table 4. Max concurrent requests when using the largest available context window for a prompt

In real-world use cases, average prompt sizes are often shorter, resulting in higher concurrent throughput. For instance, if the average context window is 4096 tokens, you can expect a significant increase in concurrent requests compared to using the maximum acceptable prompt length.

Model	Context Window	1× 48G L40/L40s	1× 80G A100/H100	2× 48G L40/L40s	2× 80G A100/H100	4× 48G L40/L40s	4× 80G A100/H100	8× 48G L40/L40s	8× 80G A100/H100
Llama-3-8B	4096	16	32	40	72	88	152	184	312
Llama-3-70B	4096	OOM	OOM	OOM	2	5	18	24	50
Llama-3.1-8B	4096	16	32	40	72	88	152	184	312
Llama-3.1-70B	4096	OOM	OOM	OOM	2	5	18	24	50
Mistral-7B-v0.3	4096	17	33	41	73	89	153	185	313
Mixtral-8x7B-v0.1	4096	OOM	OOM	OOM	17	25	57	73	137

Table 5. Max concurrent requests when using an average context window of 4096 tokens

Calculate Load for Multiple Model Replicas

When calculating the load for multiple replicas of a model, you can use the first available number that indicates the model can be held in a GPU setup.

For instance, consider the case where you want to deploy two replicas of the Llama-3-8B model on 2x L40 GPUs, with load balancing between them. In this case, each L40 GPU contains a whole-weight copy of the Llama-3-8B model, rather than distributing the weights among GPUs.

To calculate the total number of concurrent requests, you can multiply the number of replicas by the number of concurrent prompts supported by each replica. For example, assuming a prompt size of 4k, the calculation would be:

2 replicas × 16 concurrent prompts per replica = 32 concurrent requests

Calculate Estimated Latency

The total time to solution, which is the time it takes to receive a response, consists of two main components:

Prefill Time: The time it takes to process the input prompt, which includes the time to populate the key-value (KV) cache.
Token Generation Time: The time it takes to generate each new token in the response.

Note: The calculations presented below provide simplified estimates of inference time, which should be considered as lower bounds (or best estimates), as they do not account for various additional factors that can impact performance, such as GPU communication costs, network delays, and software stack overhead. In practice, these factors can increase inference time by up to 50% or more, depending on the specific system configuration and workload.

Prefill_time_per_token on each GPU

The prefill section assumes that we batch all of the prompt tokens into a single forward pass. For simplicity, we assume the limiting bottleneck is compute, and not memory. Thus, we can use the following formula to estimate the prefill time (in milliseconds) required to process a single token in a prompt.

Prefill_time_per_token 
    = number of LLM's parameters per GPU ÷ accelerator compute bandwidth
    = (2 × 8B) FLOP ÷ 864 GB/s 
    = 0.077 ms for llama-3-8b on L40

When using parallelism to deploy a model (e.g., tensor parallelism), weights are evenly distributed among multi-GPUs, thus we should use the portion of parameters in each GPU to estimate. In Table 6, for 70B and 8x7B models, we show the minimum number of GPUs required to hold them. Specifically, for 8x7B models, we use the “active” 13 billion parameters used by two experts for estimation.

Model Size	7B	8B	70B and Minimum GPUs Required		8x7B and Minimum GPUs Required
A10	0.112	0.128	0.160	7	0.208	5
A30	0.042	0.048	0.061	7	0.079	5
L40	0.077	0.088	0.193	4	0.144	3
L40s	0.039	0.044	0.097	4	0.072	3
A100 40 GB	0.045	0.051	0.112	4	0.083	3
A100 40 GB SXM	0.045	0.051	0.112	4	0.083	3
A100 80 GB PCIe	0.045	0.051	0.224	2	0.083	2
A100 80 GB SXM	0.045	0.051	0.224	2	0.083	2
H100 PCIe	0.009	0.011	0.046	2	0.017	2
H100 SXM	0.007	0.008	0.035	2	0.013	2
H100 NVL	0.004	0.004	0.018	2	0.007	2

Table 6. Prefill time (ms) to compute a single token of different model sizes in the prompt for GPUs

Generation_time_per_token on each GPU

The autoregressive part of generation is memory-bound [3]. We can estimate the time (in milliseconds) required to generate a single token in the response by the following formula. This metric is also known as Time Per Output Token (TPOT).

Generation_time_per_token 
    = number of bytes moved (the model weights) per GPU ÷ accelerator memory bandwidth
    = (2 × 8B) FLOP ÷ 181 TFLOP/s 
    = 16.2 ms for llama-3-8b on L40

Similarly in Table 7, we use the portion of parameters allocated to each GPU to estimate sizing and performance for 70B and 8x7B models.

Model sizes	7B	8B	70B and Minimum GPUs Required		8x7B and Minimum GPUs Required
A10	23.3	26.7	33.3	7	43.3	5
A30	15.0	17.1	21.4	7	27.9	5
L40	16.2	18.5	40.5	4	30.1	3
L40s	16.2	18.5	40.5	4	30.1	3
A100 40 GB	9.0	10.3	22.5	4	16.7	3
A100 40 GB SXM	9.0	10.3	22.5	4	16.7	3
A100 80 GB PCIe	7.2	8.3	36.2	2	13.4	2
A100 80 GB SXM	6.9	7.8	34.3	2	12.8	2
H100 PCIe	7.0	8.0	35.0	2	13.0	2
H100 SXM	4.2	4.8	20.9	2	7.8	2
H100 NVL	1.8	2.1	9.0	2	3.3	2

Table 7. Generation_time_per_token (ms) of different model sizes in the response for GPUs

The Generation_time_per_token is useful to estimate whether it meets your requirements of time-to-first-token (TTFT) and Token-per-second (TPS) or inter-token-latency (ITL).

To deliver a seamless user experience in chat-type applications, it’s essential to achieve a time-to-first-token (TTFT) below the average human visual reaction time of 200 milliseconds. This metric is also impacted by other factors, such as network speed, prompt length, and model size. Consider these factors that will impact your real values on your infrastructure.

A fast TTFT is only half the equation; a high tokens-per-second (TPS) is equally important for real-time applications like chat. Typically, a TPS of 30 or higher is recommended. This corresponds to an ITL of around 33.3 milliseconds when using streaming output. To put this into perspective, consider the average reading speed of humans. The average reading speed is estimated to be between 200-300 words per minute, with exceptional readers reaching up to 1,000 words per minute. In comparison, a model generating 30 tokens per second (approximately 90 words per second) can produce around 1,350 words per minute. This is significantly faster than even the fastest human readers, making it more than capable of keeping pace with their reading speed.

Total latency of a request

We can calculate the estimated total time for a single prompt and its response by the following formula.

Estimated response time 
    = prefill_time + generation time
    = prompt_size × prefill_time_per_token + response_size × generation_time_per_token
    = 4000 × 0.077 + 256 × 16.2 ms = 4.5 s

Model sizes	7B	8B	70B and Minimum GPUs Required		8x7B and Minimum GPUs Required
A10	6.4	7.3	9.2	7	11.9	5
A30	4.0	4.6	5.7	7	7.4	5
L40	4.5	5.1	11.1	4	8.3	3
L40s	4.3	4.9	10.8	4	8.0	3
A100 40 GB	2.5	2.8	6.2	4	4.6	3
A100 40 GB SXM	2.5	2.8	6.2	4	4.6	3
A100 80 GB PCIe	2.0	2.3	10.2	2	3.8	2
A100 80 GB SXM	1.9	2.2	9.7	2	3.6	2
H100 PCIe	1.8	2.1	9.1	2	3.4	2
H100 SXM	1.1	1.3	5.5	2	2.0	2
H100 NVL	0.5	0.5	2.4	2	0.9	2

Table 8. Estimated Latency (s) of prompt_size = 4000 and response_size = 256

Use the Estimation Calculator Script

Now that you have a solid understanding of the key factors that impact LLM inference performance, you can leverage the provided Python script to estimate the memory footprint, capacity, and latency on VMware Private Foundation AI with NVIDIA.
To get started, simply update the input variables in the script to reflect your desired configuration:

num_gpu: Specify the number of GPUs you plan to use for your deployment.
prompt_sz: Define the average size of the input prompts you expect to process.
response_sz: Set the average size of the responses you expect to generate.
n_concurrent_req: Indicate the number of concurrent requests you anticipate handling.
avg_context_window: Specify the average context window size for your use case.

By modifying these variables, you can easily estimate the performance characteristics of your LLM deployment and make informed decisions about your infrastructure requirements.

✗ python LLM_size_pef_calculator.py -g 1 -p 4096 -r 256 -c 10 -w 1024
 num_gpu = 1, prompt_size = 4096 tokens, response_size = 256 tokens
 n_concurrent_request = 10, avg_context_window = 1024 tokens

******************** Estimate LLM Memory Footprint ********************
| Model           | KV Cache Size per Token   | Memory Footprint   |
|-----------------+---------------------------+--------------------|
| Llama-3-8B      | 0.000488 GiB/token        | 21.00 GB           |
| Llama-3-70B     | 0.002441 GiB/token        | 165.00 GB          |
| Llama-3.1-8B    | 0.000488 GiB/token        | 21.00 GB           |
| Llama-3.1-70B   | 0.002441 GiB/token        | 165.00 GB          |
| Mistral-7B-v0.3 | 0.000488 GiB/token        | 19.00 GB           |

******************** Estimate LLM Capacity and Latency ********************
| Model           | GPU             | KV Cache Tokens   | Prefill Time   | Generation Time   | Estimated Response Time   |
|-----------------+-----------------+-------------------+----------------+-------------------+---------------------------|
| Llama-3-8B      | A10             | 16384.0           | 0.128 ms       | 26.667 ms         | 7.4 s                     |
| Llama-3-8B      | A30             | 16384.0           | 0.048 ms       | 17.149 ms         | 4.6 s                     |
| Llama-3-8B      | L40             | 65536.0           | 0.088 ms       | 18.519 ms         | 5.1 s                     |
| Llama-3-8B      | L40s            | 65536.0           | 0.044 ms       | 18.519 ms         | 4.9 s                     |
| Llama-3-8B      | A100 40 GB      | 49152.0           | 0.051 ms       | 10.289 ms         | 2.8 s                     |
| Llama-3-8B      | A100 40 GB SXM  | 49152.0           | 0.051 ms       | 10.289 ms         | 2.8 s                     |
| Llama-3-8B      | A100 80 GB PCIe | 131072.0          | 0.051 ms       | 8.269 ms          | 2.3 s                     |
| Llama-3-8B      | A100 80 GB SXM  | 131072.0          | 0.051 ms       | 7.847 ms          | 2.2 s                     |
| Llama-3-8B      | H100 PCIe       | 131072.0          | 0.011 ms       | 8.000 ms          | 2.1 s                     |
| Llama-3-8B      | H100 SXM        | 131072.0          | 0.008 ms       | 4.776 ms          | 1.3 s                     |
| Llama-3-8B      | H100 NVL        | 352256.0          | 0.004 ms       | 2.051 ms          | 0.5 s                     |
| Llama-3-70B     | A10             | OOM               | 1.120 ms       | 233.333 ms        | 64.3 s                    |
| Llama-3-70B     | A30             | OOM               | 0.424 ms       | 150.054 ms        | 40.2 s                    |
| Llama-3-70B     | L40             | OOM               | 0.773 ms       | 162.037 ms        | 44.6 s                    |
| Llama-3-70B     | L40s            | OOM               | 0.387 ms       | 162.037 ms        | 43.1 s                    |
| Llama-3-70B     | A100 40 GB      | OOM               | 0.449 ms       | 90.032 ms         | 24.9 s                    |
| Llama-3-70B     | A100 40 GB SXM  | OOM               | 0.449 ms       | 90.032 ms         | 24.9 s                    |
| Llama-3-70B     | A100 80 GB PCIe | OOM               | 0.449 ms       | 72.351 ms         | 20.4 s                    |
| Llama-3-70B     | A100 80 GB SXM  | OOM               | 0.449 ms       | 68.661 ms         | 19.4 s                    |
| Llama-3-70B     | H100 PCIe       | OOM               | 0.093 ms       | 70.000 ms         | 18.3 s                    |
| Llama-3-70B     | H100 SXM        | OOM               | 0.071 ms       | 41.791 ms         | 11.0 s                    |
| Llama-3-70B     | H100 NVL        | 19660.8           | 0.035 ms       | 17.949 ms         | 4.7 s                     |

Conclusion

In this blog post, we have provided a comprehensive guide to help you plan and deploy Large Language Models (LLMs) on VMware Private AI Foundation with NVIDIA. We have discussed the key factors that impact LLM inference performance, including GPU specifications and model specifications. We have also provided a set of formulas, tables, and a Python script to help you estimate the memory footprint, capacity, and latency of your LLM deployment based on your requirements. We hope that this blog post helps to guide you deliver a seamless user experience in chat-type applications on VMware Private AI Foundation with NVIDIA.

Reference

[1] VMware Private AI Foundation with NVIDIA

[2] A guide to LLM inference and performance

[3] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[4] Efficient Memory Management for Large Language Model Serving with PagedAttention

[5] Mixtral 8x7B

Acknowledgments

The author thanks Ramesh Radhakrishnan, Frank Dennemen, Rick Battle, Jessiely Consolacion, Shobhit Bhutani, and Roger Fortier from Broadcom’s VMware Cloud Foundation Division for reviewing and improving the paper.

The post LLM Inference Sizing and Performance Guidance appeared first on VMware Cloud Foundation (VCF) Blog.

LLM Inference Sizing and Performance Guidance

Understand Your GPU’s Specifications

Understand Your LLM’s Specifications

Calculating KV Cache Size per token for each model

Calculate the Memory Footprint for a Specific Model

Calculate Estimated Capacity

Calculate Load for Multiple Model Replicas

Calculate Estimated Latency

Prefill_time_per_token on each GPU

Generation_time_per_token on each GPU

Total latency of a request

Use the Estimation Calculator Script

Conclusion

Reference

Acknowledgments

Related Posts:

Related Articles

LLM Inference Sizing and Performance Guidance

Upgrade and Patch a VCF 5.2 Workload Domain in One Operation

Unveiling the Future of Sovereign Cloud: Key Insights from the EU Sovereign Cloud Day in Brussels, 2024

Avi Load Balancer Sessions for VMware Explore 2024 Barcelona – Part One

Data Services Manager: The Hidden Gem Powering Modern Workloads

Terraform VMware Cloud Director Provider v3.14.0

Extreme Performance Series 2024: ML and AI Performance of NVIDIA GPUs on VCF

VMware Cloud on AWS – SDDC Logs Update

Extreme Performance Series 2024: Automated Testing with Virtualized GPUs for ML/AI Workloads

Onboarding Llama3 to the Private AI Model Gallery

Introduction to the VMware Cloud Foundation (VCF) Import Tool

Why Virtualizing Industrial PCs is Key to Resilient Manufacturing Operations: Lessons from an IT Disaster

Your Favorites Revealed: Celebrating Barcelona’s People’s Choice Winners

Schedule Sessions Now for VMware Explore 2024 Barcelona

Expand Your VMware Cloud Foundation Knowledge with a VCF Course

VCF 9, Private AI and More: Announcements from the VMware Explore 2024 Las Vegas General Session

Get Ready: Edge AI Will Transform Your Business

GigaOm Recognizes VMware VeloCloud SD-Access as a Leader in Zero Trust

VMware Communities, The Heart of VMware Explore 2024

Recap – VMware {code} Theater Sessions at VMware Explore Las Vegas 2024