Home > Blogs > VMware vSphere Blog


Advanced VMkernel Settings for Disk Storage

As regular readers will know by now, many of these blog posts are a result of internal discussions held between myself and other VMware folks (or indeed storage partners). This one is no different. I was recently involved in a discussion about how VMs did sequential I/O, which led me to point out a number of VMkernel parameters related to performance vs fairness for VM I/O. In fact, I have seen other postings about these parameters, but I realised that I never did post anything myself. 

A word of caution! These parameters have already been fine tuned by VMware. There should be no need to modify these parameters. If you do, you risk impacting your own environment. As mentioned, this is all about performance vs fairness. Tuning these values can give you some very fast VMs but can also give you some very slow ones. You've been warned.

Disk.SchedNumReqOutstanding
This is the maximum number of I/Os one VM can issue all the way down to the LUN when there is more than one VM pushing I/O to the same LUN – the default was 16 in pre ESX 3.5. This was bumped to 32 in ESX 3.5, and remains at 32 today.

Disk.SchedQuantum
The maximum number of consecutive “sequential” I/O’s allowed from one VM before we force a switch to another VM (unless this is the only VM on the LUN). Disk.SchedQuantum is set to a default value of 8.
But how do we figure out if the next I/O is sequential or not? That's a good question.

Disk.SectorMaxDiff
As mentioned, we need a figure of ‘proximity’ to see if the next I/O of a VM is ‘sequential’. If it is, then we give the VM the benefit of getting the next I/O slot as it will likely be served faster by the storage. If it is outside this proximity, then the I/O goes to the next VM for fairness. This value is the maximum distance in disk sectors when considering if two I/Os are “sequential”. Disk.SectorMaxDiff defaults to 2000 sectors.

Disk.SchedQControlVMSwitches
This value is used to determine when to throttle down the amount of I/Os sent by one VM to the queue. It refers to the number of times we switch between VMs to handle I/O – if we switch this many times, then we reduce the maximum number of commands that can be queued. The default is 6 switches.

Disk.SchedQControlSeqReqs
This is used to determine when to throttle back up to the full queue depth. It refers to the number of times we issue I/O’s from the same VM before we go back to using the full LUN queue depth. The default is 128. In other words, if the same VM issues 128 I/Os without any other VM wishing to issue I/Os in the same timeframe, we throttle the number of I/Os per VM back to its maximum.

While researching for this post, I came across a bunch of other advanced disk parameters in my notes which I though you might like to know about.

Disk.PathEvalTime
Amount of time to wait before checking status of failed path. The default is 300 seconds (5 minutes). This means that if you have a preferred path (fixed path policy) and you have failed over to an alternate path, every 300 seconds the VMkernel will issue a TUR (Test Unit Ready) SCSI command to see if the preferred path has come back online. When it does, I/O will be moved back to the preferred path.

Disk.SupportSparseLUN
Wow – this setting brings me back. Let's say that the SAN administrator presented LUN 0,1,2 & 4,5,6 to your ESXi host. If Disk.SupportSparseLUN is turned off, when we found the gap in LUNs, we wouldn't find any LUNs beyond this point. Having Disk.SupportSparseLUN enabled (which it is by default) means that we can traverse these gaps in LUNs. I'm pretty sure this is only relevant to the SCSI Bus Walking discovery method – see the next advanced setting.

Disk.UseReportLUN
The storage stack uses the SCSI REPORT_LUNS command to detect LUNs on a target. The SCSI REPORT LUNS command requests a target to return a logical unit inventory (LUN list) to the initiator rather than querying each LUN individually, i.e. SCSI Bus Walking. The option is enabled by default. Believe me, you do not want to use SCSI bus walking unless you get a kick out of having a really slow ESXi boot time.

Disk.UseDeviceReset & Disk.UseLUNReset
These two parameters, taken together, determine the type of SCSI reset. The following table shows the available types:

Reset-table
*The default is LUN Reset.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

This entry was posted in Storage, vSphere and tagged , , , , , by Cormac Hogan. Bookmark the permalink.
Cormac Hogan

About Cormac Hogan

Cormac Hogan is a senior technical marketing architect within the Cloud Infrastructure Product Marketing group at VMware. He is responsible for storage in general, with a focus on core VMware vSphere storage technologies and virtual storage, including the VMware vSphere® Storage Appliance. He has been in VMware since 2005 and in technical marketing since 2011.

5 thoughts on “Advanced VMkernel Settings for Disk Storage

  1. Cormac, great post as usual! I have been told that leaving Disk.SupportSparseLUN enabled can contribute to rescan storms when you have lots of datastores. One can reduce the number of SCSI id slots being scanned as well as disabling sparse LUN support can help speed up rescan times. At the moment, we’re forced to shut off automatic rescans, then scan and create datastores on one host. Afterwards, we rescan the other members of the cluster one by one in order to avoid rescan storms which can cause VMs to hang.
    What are your thoughts on this?
    Johnny

  2. Hi Johnny,
    That is true, if we were still using the bus walking method. If you have all your disks in a contiguous range starting from 0, then once we meet the first empty position, we stop scanning.
    However REPORT_LUNs avoids this as it requests a target SCSI layer to return a logical unit inventory (LUN list) to the initiator SCSI layer rather than querying each LUN individually.
    My understanding is that Disk.SupportSparseLUN doesn’t play a role when REPORT_LUNs is used (and that is the default since ESX 2.x I think)

  3. Cormac, I believe that you may be the person that can solve a problem we have with VMWare. We have a real-time application that we have run for years on sets of dedicated servers. One of our customers wanted us to try to run it in their virtual environment adding new dedicated blades. Our application initially appeared to crash the VM but it seems our application runs for a while then blocks for some extended period of time (not good for real-time). I believe that the virtualization of the IO includes IO buffers that we are filling faster than they can be emptied and eventually they fill and we then must be blocked until they can be flushed. Is there a way in VMWare to NOT buffer IO at all or make an individual VM have such a small buffer that it continuously blocks for a short period of time?

    • Hi Dan,
      We don’t actually buffer I/O in the VMkernel. The reason for this is because the Guest OS will have a buffer cache, and once the Guest believe that it has committed a block to disk, we gotta make sure that block makes it all the way to persistent storage or we’ll end up in all sorts of problems when a host crashes.

      Anyway, this could be a number of things. One thought is that it could be related to the maximum I/O size (kb.vmware.com/kb/1003469) but that’s a shot in the dark. I’d urge you to open an SR with our support folks who can examine I/O behaviour via esxtop and give you a more educated response.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>