Home > Blogs > VMware vSphere Blog


Troubleshooting Storage Performance in vSphere – Part 1 – The Basics

I frequently present at the various VMware User Group (VMUG) meetings, VMworld and partner conferences.  If you have ever attended one of my talks, you will know it is like trying to drink from a fire hose, it is hard to cover everything in just a 45 min session. Therefore I will take the time here to  write a few blogs that go over the concepts discussed in these talks in more detail (or at least slower). One of the most popular yet very fast paced talks I present is the Troubleshooting Storage Performance in vSphere. I’ll slow things down a bit and discuss each topic here, this might be just a review for some of you but hopefully as we get into more details there will be some new nuggets of VMware specific information that can help even the more advanced storage folks. 

Today’s post is just the basics.  What is bad storage performance and where do I measure it?

Poor storage performance is generally the result of high I/O latency. vCenter or esxtop will report the various latencies at each level in the storage stack from the VM down to the storage hardware.  vCenter cannot provide information for the actual latency seen by the application since that includes the latency at the Guest OS and the application itself, and these items are not visible to vCenter. vCenter can report on the following storage stack I/O latencies in vSphere.

 Storage Stack Components in a vSphere environment

LatencyInStorageStack

GAVG (Guest Average Latency) total latency as seen from vSphere

KAVG (Kernel Average Latency) time an I/O request spent waiting inside the vSphere storage stack. 

QAVG (Queue Average latency) time spent waiting in a queue inside the vSphere Storage Stack.

DAVG (Device Average Latency) latency coming from the physical hardware, HBA and Storage device.

 

 

To provide some rough guidance, for most application workloads (typically 8k I/O size, 80% Random, 80% Read) we generally say anything greater than 20 to 30 ms of I/O Latency may be a performance concern. Of course as with all things performance related some applications are more sensitive to I/O latency then others so the 20-30ms guidance is a rough guidance rather than a hard rule. So we expect that GAVG or total latency as seen from vCenter should be less than 20 to 30 ms.  as seen in the picture, GAVG is made up of KAVG and DAVG.  Ideally we would like all our I/O to quickly get out on to the wire and thus spend no significant amount of time just sitting in the vSphere storage stack,  so we would ideally like to see KAVG very low.  As a rough guideline KAVG should usual be 0 ms and anything greater than 2ms may be an indicator of a performance issue. 

So what are the rule of thumb indicators of bad storage performance? 

•             High Device Latency: Device Average Latency (DAVG) consistently greater than 20 to 30 ms may cause a performance problem for your typical application. 

•             High Kernel Latency: Kernel Average Latency (KAVG) should usually be 0 in an ideal environment, but anything greater than 2 ms may be a performance problem.

So what can cause bad storage performance and how to address it, well that is for next time…

And as a side note:  Check out your local VMUG (VMware User Group).  The VMUG community has more than 75,000 members with more than 180 local groups across 32 countries.  Many local area VMUGS have free user conferences that are a great opportunity to learn from in-depth technical sessions, demonstrations and exhibits and network with other VMware customers and partners. That is where I’ll be this week, presenting Storage Troubleshooting and Performance Best Practices at the Denver VMUG. Check out the VMUGs and maybe I’ll see you at a VMUG in your area.   http://www.vmug.com

Continue to Troubleshooting Storage Performance – Part 2:
http://blogs.vmware.com/vsphere/2012/06/troubleshooting-storage-performance-in-vsphere-part-2.html

6 thoughts on “Troubleshooting Storage Performance in vSphere – Part 1 – The Basics

  1. Thanks Loren, I’ll provide some NFS specific guidance a bit later on in the Storage Performance Troubleshooting Series, but the general recommendation applies. If you see latencies on your NFS Datastore greater than 20 to 30ms then that may be causing a performance impact in your environment.

  2. The first rule of thumb states “Device Average Latency (DAVG) consistently greater than 20 to 30 ms”. I am assuming this is consistent I/O above 20 to 30 ms. Is it normal to have 3-5 minute periods of 50 to 125 ms peaks in performance due to high I/O applications on VMs across multiple LUNS?

  3. Well as always “it depends”, but in general yes, an occasional 3-5 minute period of higher than 20-30ms latency would not be a big concern. However you would probably want to monitor the situation and try and determine what is causing the spike and if possible try and move the spike to a non-production time of the day. For instance if the VM was doing a backup or virus check which was causing the spike then those could easily be moved to a time of the day that would not impact the performance of your environment as much.

  4. Thanks for sharing. I’ve seen KAVG around 2-3 ms at a global bank. Luckily it was due to storage. But if the storage team were to say the array was doing well, it would have been difficult. KAVG is more of a clue than root cause.
    How do we detect that the high KAVG were due to:
    – I/O Stack Queue congestion
    - Guest Level Driver and Queuing Interactions
    - Incorrectly Tuned Applications

    Thanks from Singapore.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>