Home > Blogs > VMware VROOM! Blog > Tag Archives: vsphere

Tag Archives: vsphere

Performance of SQL Server 2017 for Linux VMs on vSphere 6.5

Microsoft SQL Server has long been one of the most popular applications for running on vSphere virtual machines.  Last year there was quite a bit of excitement when Microsoft announced they were bringing SQL Server to Linux.  Over the last year Microsoft has had quite a bit of interest in SQL Server for Linux and it was announced at Microsoft Ignite last month that it is now officially launched and generally available.

VMware and Microsoft have collaborated to validate and support the functionality and performance scalability of SQL Server 2017 on vSphere-based Linux VMs.  The results of that work show SQL Server 2017 for Linux installs easily and has great performance within VMware vSphere virtual machines. VMware vSphere is a great environment to be able to try out the new Linux version of SQL Server and be able to also get great performance.

Using CDB, a cloud database benchmark developed by the Microsoft SQL Server team, we were able to verify that the performance of SQL Server for Linux in a vSphere virtual machine was similar to other non-virtualized and virtualized operating systems or platforms.

Our initial reference test size was relatively small, so we wanted to try out testing larger sizes to see how well SQL Server 2017 for Linux performed as the VM size was scaled up.  For the test, we used a four socket Intel Xeon E7-8890 v4 (Broadwell)-based server with 96 cores (24 cores per socket).  The initial test began with a 24 virtual CPU VM to match the number of physical cores of a single socket.  Additional tests were run by increasing the size of the VM by 24 vCPUs for each test until, in the final test, the VM had 96 total vCPUs.  We configured the virtual machine with 512 GB of RAM and separate log and data disks on an SSD-based Fibre Channel SAN.  We used the same best practices for SQL Server for Linux as what we normally use for the windows version as documented in our published best practices guide for SQL Server on vSphere.

The results showed that SQL Server 2017 for Linux scaled very well as the additional vCPUs were added to the virtual machine. SQL Server 2017 for Linux is capable of scaling up to handle very large databases on VMware vSphere 6.5 Linux virtual machines.

Skylake Update – Oracle Database Performance on vSphere 6.5 Monster Virtual Machines

We were able to get one of the new four-socket Intel Skylake based servers and run some more tests. Specifically we used the Xeon Platinum 8180 processors with 28 cores each. The new data has been added to the Oracle Monster Virtual Machine Performance on VMware vSphere 6.5 whitepaper. Please check out the paper for the full details and context of these updates.

The generational testing in the paper now includes a fifth generation with a 112 vCPU virtual machine running on the Skylake based server. Performance gain from the initial 40 vCPU VM on Westmere-EX to the Skylake based 112 vCPU VM is almost 4x.

The performance gained from Hyper-Threading was also updated and shows a 27% performance gain from the use of Hyper-Threads. The test was conducted by running two 112 vCPU VMs at the same time so that all 224 logical threads are active. The total throughput from the two VMs is then compared with the throughput from a single VM.

My colleague David Morse has also updated his SQL Server monster virtual machine whitepaper with Skylake data as well.

Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning

In our third episode of machine learning performance with vSphere 6.x, we look at the virtual GPU vs. the physical GPU. In addition, we extend the performance results of machine learning workloads using VMware DirectPath I/O (passthrough) vs. NVIDIA GRID vGPU that have been partially addressed in previous episodes:

Machine Learning with Virtualized GPUs

Performance is one of the biggest concerns that keeps high performance computing (HPC) users from choosing virtualization as the solution for deploying HPC applications despite virtualization benefits such as reduced administration costs, resource utilization efficiency, energy saving, and security. However, with the constant evolution of virtualization technologies, the performance gaps between bare metal and virtualization have almost disappeared, and, in some use cases, virtualized applications can achieve better performance than running on bare metal because of the intelligent and highly optimized resource utilization of hypervisors. For example, a prior study [1] shows that vector machine applications running on a virtualized cluster of 10 servers have a better execution time than running on bare metal.

Virtual GPU vs. Physical GPU

To understand the performance impact of machine learning with GPUs using virtualization, we used a complex language modeling application—predicting next words given a history of previous words using a recurrent neural network (RNN) with 1500 Long Short Term Memory (LSTM) units per layer, on the Penn Treebank (PTB) dataset [2, 3], which has:

  • 929,000 training words
  • 73,000 validation words
  • 82,000 test words
  • 10,000 vocabulary words

We tested three cases:

  • A physical GPU installed on bare metal (this is the “native” configuration)
  • A DirectPath I/O GPU inside a VM on vSphere 6
  • A GRID vGPU (that is, an M60-8Q vGPU profile with 8GB memory) inside a VM on vSphere 6

The VM in the last two cases has 12 virtual CPUs (vCPUs), 60GB RAM, and 96GB SSD storage.

The benchmark was implemented using TensorFlow [4], which was also used for the implementation of the other machine learning benchmarks in our experiments. We used CUDA 7.5, cuDNN 5.1, and CentOS 7.2 for both native and guest operating systems. These test cases were run on a Dell PowerEdge R730 server with dual 12-core Intel Xeon Processors E5-2680 v3, 2.50 GHz sockets (24 physical core, 48 logical with hyperthreading enabled), 768 GB memory, and an SSD (1.5 TB). This server also had two NVIDIA Tesla M60 cards (each has two GPUs) for a total of 4 GPUs where each had 2048 CUDA cores, 8GB memory, 36 x H.264 video 1080p 30 streams, and could support 1–32 GRID vGPUs whose memory profiles ranged from 512MB to 8GB. This experimental setup was used for all tests presented in this blog (Figure 1, below).

Figure 1. Testbed configurations for native GPU vs. virtual GPU comparison

The results in Figure 2 (below) show the relative execution times of DirectPath I/O and GRID vGPU compared to native GPU. Virtualization introduces a 4% overhead—the performance of DirectPath I/O and GRID vGPU are similar. These results are consistent with prior studies of virtual GPU performance with passthrough where overheads in most cases are less than 5% [5, 6].

Figure 2. DirectPath I/O and NVIDIA GRID vs. native GPU

GPU vs. CPU in a Virtualization Environment

One important benefit of using GPU is the shortening of the long training times of machine learning tasks, which has boosted the results of AI research and developments in recent years. In many cases, it helps to reduce execution times from weeks/days to hours/minutes. We illustrate this benefit in Figure 3 (below), which shows the training time with and without vGPU for two applications:

  • RNN with PTB (described earlier)
  • CNN with MNIST—a handwriting recognizer that uses a convolution neural network (CNN) on the MNIST dataset [7].

From the results, we see that the training time for RNN on PTB with CPU was 7.9 times higher than with vGPU training time (Figure 3-a).  The training time for CNN on MNIST with CPU was 10.1 times higher than with the vGPU training time (Figure 3-b). The VM used in this test has 1 vGPU, 12 vCPUs, 60 GB memory, 96 GB of SSD storage and the test setup is similar to that of the above experiment.

Figure 3. Normalized training time of PTB, MNIST with and without vGPU

As the test results show, we can successfully run machine learning applications in a vSphere 6 virtualized environment, and its performance is similar to training times for machine learning applications running in a native configuration (not virtualized) using physical GPUs.

But what about a passthrough scenario? How does a machine learning application run in a vSphere 6 virtual machine using a passthrough to the physical GPU vs. using a virtualized GPU? We present our findings in the next section.

Comparison of DirectPath I/O and GRID vGPU

We evaluate the performance, scalability, and other benefits of DirectPath I/O and GRID vGPU. We also provide some recommendations of the best use cases for each virtual GPU solutions.

Performance

To compare the performance of DirectPath I/O and GRID vGPU, we benchmarked them with RNN on PTB, and CNN on MNIST and CIFAR-10. CIFAR-10 [8] is an object classification application that categorizes RGB images of 32×32 pixels into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. MNIST is a handwriting recognition application. Both CIFAR-10 and MNIST use a convolutional neural network. The language model used to predict words is based on history using a recurrent neural network. The dataset used is The Penn Tree Bank (PTB).

Fig. 4. Performance comparison of DirectPath I/O and GRID vGPU.

The results in Figure 4 (above) show the comparative performance of the two virtualization solutions in which DirectPath I/O achieves slightly better performance than GRID vGPU. This improvement is due to the passthrough mechanism of DirectPath I/O adding minimal overhead to GPU-based workloads running inside a VM. In Figure 4-a, DirectPath I/O is about 5% faster than GRID vGPU for MNIST, and they have the same performance with PTB. For CIFAR-10, DirectPath I/O can process about 13% more images per second than GRID vGPU. We use images per second for CIFAR-10 because it is a frequently used metric for this dataset. The VM in this experiment has 12 vCPU, 60GB VRAM and one GPU (either DirectPath I/O or GRID vGPU).

Scalability

We look at two types of scalability: user and GPU.

User Scalability

In a cloud environment, multiple users can share physical servers, which helps to better utilize resources and save cost. Our test server with 4 GPUs can allow up to 4 users needing a GPU. Alternatively, a single user can have four VMs with a vGPU.  The number of virtual machines run per machine in a cloud environment is typically high to increase utilization and lower costs [9]. Machine learning workloads are typically much more resource intensive and using our 4 GPU test systems for up to only 4 users reflects this.

Figure 5. Scaling the number of VMs with vGPU on CIFAR-10

Figure 5 (above) presents the scalability of users on CIFAR-10 from 1 to 4 where each uses a VM with one GPU, and we normalize images per second to that of the DirectPath I/O – 1 VM case (Figure 5-a).  Similar to the previous comparison, DirectPath I/O and GRID vGPU show comparable performance as the number of VMs with GPUs scale. Specifically, the performance difference between them is 6%–10% for images per second and 0%–1.5% for CPU utilization. This difference is not significant when weighed against the benefits that vGPU brings. Because of its flexibility and elasticity, it is a good option for machine learning workloads. The results also show that the two solutions scale linearly with the number of VMs both in terms of execution time and CPU resource utilization. The VMs used in this experiment have 12 vCPUs, 16GB memory, and 1 GPU (either DirectPath I/O or GRID vGPU).

GPU Scalability

For machine learning applications that need to build very large models or in which the datasets cannot fit into a single GPU, users can use multiple GPUs to distribute the workloads among them and speed up the training task further. On vSphere, applications that require multiple GPUs can use DirectPath I/O passthrough to configure VMs with as many GPUs as required. This capability is limited for CUDA applications using GRID vGPU because only 1 vGPU per VM is allowed for CUDA computations.

We demonstrate the efficiency of using multiple GPUs on vSphere by benchmarking the CIFAR-10 workload and using the metric of images per second (images/sec) to compare the performance of CIFAR-10 on a VM with different numbers of GPUs scaling from 1 to 4 GPUs.

From the results in Figure 6 (below), we found that the images processed per second improves almost linearly with the number of GPUs on the host (Figure 6-a). At the same time, their CPU utilization also increases linearly (Figure 6-b). This result shows that machine learning workloads scale well on the vSphere platform. In the case of machine learning applications that require more GPUs than the physical server can support, we can use the distributed computing model with multiple distributed processes using GPUs running on a cluster of physical servers. With this approach, both DirectPath I/O and GRID vGPU can be used to enhance scalability with a very large number of GPUs.

Figure 6. Scaling the number of GPUs per VM on CIFAR-10

How to Choose Between DirectPath I/O and GRID vGPU

For DirectPath I/O

From the above results, we can see that DirectPath I/O and GRID vGPU have similar performance and low overhead compared to the performance of native GPU, which makes both good choices for machine learning applications in virtualized cloud environments. For applications that require short training times and use multiple GPUs to speed up machine learning tasks, DirectPath I/O is a suitable option because this solution supports multiple GPUs per VM. In addition, DirectPath I/O supports a wider range of GPU devices, and so can provide a more flexible choice of GPU for users.

For GRID vGPU

When each user needs a single GPU, GRID vGPU can be a good choice. This configuration provides a higher consolidation of virtual machines and leverages the benefits of virtualization:

  • GRID vGPU allows the flexible use of the device because vGPU supports both shared GPU (multiple users per physical machine) and dedicated GPU (one user per physical GPU). Mixing and switching among machine learning, 3D graphics, and video encoding/decoding workloads using GPUs is much easier and allows for more efficient use of the hardware resource. Using GRID solutions for machine learning and 3D graphics allows cloud-based services to multiplex the GPUs among more concurrent users than the number of physical GPUs in the system. This contrasts with DirectPath I/O, which is the dedicated GPU solution, where the number of concurrent users are limited to the number of physical GPUs.
  • GRID vGPU reduces administration cost because its deployment and maintenance does not require server reboot, so no down time is required for end users. For example, changing the vGPU profile of a virtual machine does not require a server reboot. Any changes to DirectPath I/O configuration requires a server reboot. GRID vGPU’s ease of management reduces the time and the complexity of administering and maintaining the GPUs. This benefit is particularly important in a cloud environment where the number of managed servers would be very large.

Conclusion

Our tests show that virtualized machine learning workloads on vSphere with vGPUs offer near bare-metal performance.

References

  1. Jaffe, D. Big Data Performance on vSphere 6. (August 2016). http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/bigdata-perf-vsphere6.pdf.
  2. Zaremba, W., Sutskever,I., Vinyals, O.: Recurrent Neural Network Regularization. In: arXiv:1409.2329 (2014).
  3. Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview. In: Abeille, A. (ed.). Treebanks: the state of the art in syntactically annotated corpora. Kluwer (2003).
  4. Tensorflow Homepage, https://www.tensorflow.org
  5. Vu, L., Sivaraman, H., Bidarkar, R.: GPU Virtualization for High Performance General Purpose Computing on the ESX hypervisor. In: Proc. of the 22nd High Performance Computing Symposium (2014).
  6. Walters, J.P., Younge, A.J., Kang, D.I., Yao, K.T., Kang, M., Crago, S.P., Fox, G.C.: GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications. In: Proceedings of 2014 IEEE 7th International Conference on Cloud Computing (2014).
  7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, 86(11):2278-2324 (November 1998).
  8. Multiple Layers of Features from Tiny Images, https://www.cs.toronto.edu/~kriz/cifar.html
  9. Pandey, A., Vu, L., Puthiyaveettil, V., Sivaraman, H., Kurkure, U., Bappanadu, A.: An Automation Framework for Benchmarking and Optimizing Performance of Remote Desktops in the Cloud. In: To appear in Proceedings of the 2017 International Conference on High Performance Computing & Simulation (2017).

Updated – SQL Server VM Performance with vSphere 6.5, October 2017

Back in March, I published a performance study of SQL Server performance with vSphere 6.5 across multiple processor generations.  Since then, Intel has released a brand-new processor architecture: the Xeon Scalable platform, formerly known as Skylake.

Our team was fortunate enough to get early access to a server with these new processors inside – just in time for generating data that we presented to customers at VMworld 2017.

Each Xeon Platinum 8180 processor has 28 physical cores (pCores), and with four processors in the server, there was a whopping 112 pCores on one physical host!  As you can see, that extra horsepower provides nice database server performance scaling:

Generational SQL Server VM Database Performance

Generational SQL Server VM Database Performance

For more details and the test results, take a look at the updated paper:
Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Introducing DRS DumpInsight

In an effort to provide a more insightful user experience and to help understand how vSphere DRS works, we recently released a fling: DRS Dump Insight.

DRS Dump Insight is a service portal where users can upload drmdump files and it provides a summary of the DRS run, with a breakup of all the possible moves along with the changes in ESX hosts resource consumption before and after DRS run.

Users can get answers to questions like:

  • Why did DRS make a certain recommendation?
  • Why is DRS not making any recommendations to balance my cluster?
  • What recommendations did DRS drop due to cost/benefit analysis?
  • Can I get all the recommendations made by DRS?

Continue reading

DRS Lens – A new UI dashboard for DRS

DRS Lens provides an alternative UI for a DRS enabled cluster. It gives a simple, yet powerful interface to monitor the cluster real time and provide useful analyses to the users. The UI is comprised of different dashboards in the form of tabs for each cluster being monitored.

Continue reading

Oracle Database Performance on vSphere 6.5 Monster Virtual Machines

We have just published a new whitepaper on the performance of Oracle databases on vSphere 6.5 monster virtual machines. We took a look at the performance of the largest virtual machines possible on the previous four generations of four-socket Intel-based servers. The results show how performance of these large virtual machines continues to scale with the increases and improvements in server hardware.

Oracle Database Monster VM Performance across 4 generations of Intel based servers on vSphere 6.5

Oracle Database Monster VM Performance on vSphere 6.5 across 4 generations of Intel-based  four-socket servers

In addition to vSphere 6.5 and the four-socket Intel-based servers used in the testing, an IBM FlashSystem A9000 high performance all flash array was used. This array provided extreme low latency performance that enabled the database virtual machines to perform at the achieved high levels of performance.

Please read the full paper, Oracle Monster Virtual Machine Performance on VMware vSphere 6.5, for details on hardware, software, test setup, results, and more cool graphs.  The paper also covers performance gain from Hyper-Threading, performance effect of NUMA, and best practices for Oracle monster virtual machines. These best practices are focused on monster virtual machines, and it is recommended to also check out the full Oracle Databases on VMware Best Practices Guide.

Some similar tests with Microsoft SQL Server monster virtual machines were also recently completed on vSphere 6.5 by my colleague David Morse. Please see his blog post  and whitepaper for the full details.

This work on Oracle is in some ways a follow up to Project Capstone from 2015 and the resulting whitepaper Peeking at the Future with Giant Monster Virtual Machines . That project dealt with monster VM performance from a slightly different angle and might be interesting to those who are also interested in this paper and its results.

 

SQL Server VM Performance with VMware vSphere 6.5

Achieving optimal SQL Server performance on vSphere has been a constant focus here at VMware; I’ve published past performance studies with vSphere 5.5 and 6.0 which showed excellent performance up to the maximum VM size supported at the time.

Since then, there have been quite a few changes!  While this study uses a similar test methodology, it features an updated hypervisor (vSphere 6.5), database engine (SQL Server 2016), OLTP benchmark (DVD Store 3), and CPUs (Intel Xeon v4 processors with 24 cores per socket, codenamed Broadwell-EX).

Continue reading

Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2

by Hari Sivaraman, Uday Kurkure, and Lan Vu

In a previous blog [1], we looked at how machine learning workloads (MNIST and CIFAR-10) using TensorFlow running in vSphere 6 VMs in an NVIDIA GRID configuration reduced the training time from hours to minutes when compared to the same system running no virtual GPUs.

Here, we extend our study to multiple workloads—3D CAD and machine learning—run at the same time vs. run independently on a same vSphere server.

This is episode 2 of a series of blogs on machine learning with vSphere. Also see:

Continue reading

Understanding vSphere DRS Performance – A White Paper

VMware vSphere Distributed Resource Scheduler (DRS) is responsible for placement of Virtual Machines and balancing of resources in a cluster. The key driver for DRS is VM/Application happiness, and it achieves this by effective VM placement and efficient load balancing. We have a new white paper, which tries to explain how DRS works in basic scenarios and how it can be tuned to behave differently for specific scenarios.

The white paper talks about the factors that influence DRS decisions and provides some useful insights into different parameters that can be tuned in specific scenarios to make DRS more effective. It also explains how to monitor DRS to better understand its behavior.

It covers DRS behavior in specific scenarios with some case studies. Some of these studies are around

  •  VM Consumed vs. Active Memory – How it impacts DRS behavior.
  •  Impact of VM overrides on cluster balance.
  •  Prerequisite moves during initial placement.
  •  Using shares to prioritize cluster resources.

The paper provides knowledge about the factors that affect DRS behavior and helps understand how DRS does what it does. This knowledge, along with monitoring and troubleshooting tips, including real case studies, will help tune DRS clusters for optimum performance.