Home > Blogs > VMware vSphere Blog > Tag Archives: SCSI

Tag Archives: SCSI

Which vSCSI controller should I choose for performance?

I wrote a blog article in Oct 2010 on this same topic that is still frequently referenced today so I figure it was due for an update.

So what should I choose as my vSCSI controller and what are the differences between them? Continue reading

Advanced VMkernel Settings for Disk Storage

As regular readers will know by now, many of these blog posts are a result of internal discussions held between myself and other VMware folks (or indeed storage partners). This one is no different. I was recently involved in a discussion about how VMs did sequential I/O, which led me to point out a number of VMkernel parameters related to performance vs fairness for VM I/O. In fact, I have seen other postings about these parameters, but I realised that I never did post anything myself. 

A word of caution! These parameters have already been fine tuned by VMware. There should be no need to modify these parameters. If you do, you risk impacting your own environment. As mentioned, this is all about performance vs fairness. Tuning these values can give you some very fast VMs but can also give you some very slow ones. You've been warned.

This is the maximum number of I/Os one VM can issue all the way down to the LUN when there is more than one VM pushing I/O to the same LUN – the default was 16 in pre ESX 3.5. This was bumped to 32 in ESX 3.5, and remains at 32 today.

The maximum number of consecutive “sequential” I/O’s allowed from one VM before we force a switch to another VM (unless this is the only VM on the LUN). Disk.SchedQuantum is set to a default value of 8.
But how do we figure out if the next I/O is sequential or not? That's a good question.

As mentioned, we need a figure of ‘proximity’ to see if the next I/O of a VM is ‘sequential’. If it is, then we give the VM the benefit of getting the next I/O slot as it will likely be served faster by the storage. If it is outside this proximity, then the I/O goes to the next VM for fairness. This value is the maximum distance in disk sectors when considering if two I/Os are “sequential”. Disk.SectorMaxDiff defaults to 2000 sectors.

This value is used to determine when to throttle down the amount of I/Os sent by one VM to the queue. It refers to the number of times we switch between VMs to handle I/O – if we switch this many times, then we reduce the maximum number of commands that can be queued. The default is 6 switches.

This is used to determine when to throttle back up to the full queue depth. It refers to the number of times we issue I/O’s from the same VM before we go back to using the full LUN queue depth. The default is 128. In other words, if the same VM issues 128 I/Os without any other VM wishing to issue I/Os in the same timeframe, we throttle the number of I/Os per VM back to its maximum.

While researching for this post, I came across a bunch of other advanced disk parameters in my notes which I though you might like to know about.

Amount of time to wait before checking status of failed path. The default is 300 seconds (5 minutes). This means that if you have a preferred path (fixed path policy) and you have failed over to an alternate path, every 300 seconds the VMkernel will issue a TUR (Test Unit Ready) SCSI command to see if the preferred path has come back online. When it does, I/O will be moved back to the preferred path.

Wow – this setting brings me back. Let's say that the SAN administrator presented LUN 0,1,2 & 4,5,6 to your ESXi host. If Disk.SupportSparseLUN is turned off, when we found the gap in LUNs, we wouldn't find any LUNs beyond this point. Having Disk.SupportSparseLUN enabled (which it is by default) means that we can traverse these gaps in LUNs. I'm pretty sure this is only relevant to the SCSI Bus Walking discovery method – see the next advanced setting.

The storage stack uses the SCSI REPORT_LUNS command to detect LUNs on a target. The SCSI REPORT LUNS command requests a target to return a logical unit inventory (LUN list) to the initiator rather than querying each LUN individually, i.e. SCSI Bus Walking. The option is enabled by default. Believe me, you do not want to use SCSI bus walking unless you get a kick out of having a really slow ESXi boot time.

Disk.UseDeviceReset & Disk.UseLUNReset
These two parameters, taken together, determine the type of SCSI reset. The following table shows the available types:

*The default is LUN Reset.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

Configuration Settings for ALUA Devices

Posted by Cormac Hogan
Technical Marketing Manager (Storage)

If you've got an ALUA array, you've probably wondered what all those obscure configration settings mean in the esxcli device listing. I certainly have.

Let me show you what I mean.

~ #  esxcli storage nmp device  list -d naa.xxx
   Device Display Name: DGC Fibre Channel Disk (naa.xxx)
   Storage Array Type: VMW_SATP_ALUA_CX
   Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C0:T2:L100;current=vmhba2:C0:T2:L100}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba2:C0:T2:L100
   Is Local SAS Device: false
~ #

Now, there does seem to be a lot of configuration options there, doesn't there? Before describing what they are, we first need to know a little bit about ALUA. ALUA, or Asymmetric Logical Unit Access, occurs when access characteristics of one controller port on a storage array that is presenting a LUN to a host has different access characteristics when compared to another controller port on the same array.

ALUA provides a way of allowing devices report the state of its ports to hosts. This state can then be used by hosts to prioritize paths and make failover/load balancing decisions.

Let's delve a little deeper into possbile ALUA characteristics.


Explicit/Implicit ALUA

ALUA devices can operate in two modes: implicit and/or explicit.

  • Explicit ALUA devices allow host to use the "Set Target Port Group" task management command to set the Target Port Group (TPG) state. We will look at target port groups & their various states shortly. This is not configurable by the way, as it is an attirbute of the device.
  • In implicit ALUA, a device's TPG state will be managed by the target device itself. This is not configurable either, as again it is an attirbute of the device.


Optomized/Non-Optomized Paths

ALUA is typically associated with Asymmetrical Active-Active (what we will term AAA) arrays. In an AAA array, both controllers can receive I/O commands (active-active), but only one controller can issue I/O to the LUN. This is the asymmetrical part. The opposite of a AAA array is a Symmetric Active-Active (SAA) array. These SAA arrays can issue I/O commands to the LUN via both controllers

The controller in an AAA array who can issue commands is called the managing controller. Paths to the LUN via ports on this controller are called optimized paths. I/O sent to a port of the non-owning controller must be transferred to the owning controller internally. This increases latency and impacts on the performance of the array. Therefore paths to the LUN via the non-managing controller are called non-optimized paths.


Target Port Group Support

TPGS provides a method for determining the access characteristics of a path to a LUN through a target port. It supports soliciting information about different target port capabilities & supports routing I/O to the particular port or ports that can achieve the best performance. TPGS is often used with arrays that handle load balancing and failover within the array controllers. Target Port Groups allows path grouping. Each port in the same TPG has the same port state, which can be one of the following state: Active/Optimized, Active/Non-optimized, Standby, Unavailable, and In-Transition. A TPG is defined as a set of target ports that are in the same target port asymmetric access state at all times.

The ALUA SATP plugin sends RTPG (Relative Target Port Group) commands to the array to get the device server's TPG identifiers and states.


Follow-Over Algorithm & Path Thrashing

To understand the ‘follow-over’ algorithm, let’s revisit a scenario which may result in path-thrashing. If a customer uses a fixed path policy on two hosts sharing a LUN from an active-passive array and sets a preferred path to array controller A on ESX A and a preferred path to array controller B on ESX B, the hosts would always try to activate the preferred path each time path states changes. When host B changes the path state, all I/Os from host A will see an error indicating they are talking to a standby controller. Host A will then try to activate the path again after which host B will see errors and so on). In both cases, if the ESX pulls the LUN back to its preferred controller, then the other ESX will just pull it back again and you are in a path thrashing state.

Using the 'follow-over' algorithm means that if an ESX host sees the array controller of its 'preferred‘ path become inactive on the array, then it is either because:
  1. Some other host lost access to the array controller of your preferred path.
  2. Some other host had their preferred path changed to use that array controller.

In ALUA, a host can activate a standby preferred path only if it caused a failover that initially put it into standby. A host just accepts that it cannot use its 'preferred' path in some cases. This is what "follow-over" means. You follow the lead of other hosts in certain cases of failovers or preferred path changes.

‘Follow-Over’ Algorithm Example

This is a bit complicated, so lets try to explain it again using a real life example. Say host A loses access to array controller port B0 (which is currently active & preferred across all hosts). It must then select a new path. It may try to select controller port B1 then, but if this is also dead, it might select port A0 and move all its I/O over to that path. Since we are switching I/O to a different controller, this will cause a trespass where LUNs may have to move ownership from controller B to controller A. If controller B later becomes available, then host A can switch back to its original preferred path since it was the cause of the original failover.

However other hosts, which had to move to controller A because of the actions taken by host A, cannot switch I/O back to controller B. Those other hosts do not know if host A can still access controller B (even if they can access it). But if host A can see the path to port B0 again, then it can try to switch back to the preferred path.

On the other hand, if it was another host which lost access to controller B and they moved their active path to another port on controller A, this also implies that host A loses access to controller B as I/O has to move to the new active path on controller A. In this case host A cannot pull back the LUN to its preferred path on controller B, even if it has access to it. This is because host A did not initiate the failover and make the paths to controller B standby.


Phew! That's quite a bit of information. However, with all of the above in mind, we should now be able to get a good understanding of the settings in the esxcli output. Let's remind ourselves again about them:

Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}

Let's skip navireg & ipfilter for the moment as they are not related to ALUA, and let's look at the other options.

  1. explicit_support: Shows whether or not the device supports explicit ALUA; this option cannot be set by the user as it is a property of the LUN.
  2. explicit_allowed: Shows whether or not the user allows the SATP to exercise its explicit ALUA capability if the need arises during path failure. This only matters if the device actually supports explicit ALUA (i.e. explicit_support is 'on'). This option is turned on using 'enable_explicit_alua' and turned off using 'disable_explicit_alua' (esxcli command).
  3. alua_followover: Shows whether or not the user allows the SATP to exercise the 'follow-over' policy, which prevents path thrashing in multi-host set-ups. This option is turned on using 'enable_alua_followover' and turned off using 'disable_alua_followover'. (esxcli command)
  4. There are two Target Port Groups. TPG_id=1 has all the Active Optomized paths (AO). TPG_id=2 has all the Active Non-Optomized paths (ANO)

A final note on the navireg and ipfilter entries. These are not related to ALUA, but are related to the specific SATP used. The more observant of you will have figured out that this is an EMC CX series storage array and the Storage Array Type is VMW_SATP_ALUA_CX. The Storage Array Type Device Config entries {navireg ipfilter} are specific to the VMW_SATP_ALUA_CX (as well as some other EMC SATPs). The accepted values are: navireg_on, navireg_off, ipfilter_on, ipfilter_off. To complete the post, I'll include some detail about those options for the more curious amongst you.

  • navireg_on starts automatic registration of the device with Navisphere
  • navireg_off stops the automatic registration of the device
  • ipfilter_on stops the sending of the host name for Navisphere registration; used if host is known as "localhost"
  • ipfilter_off  enables the sending of the host name during Navisphere registration

That completes the post. Hope those configuration settings make a bit more sense now.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

How much storage can I present to a Virtual Machine?

This is an interesting question, and something which popped up in some recent discussions. This is nothing more than a math exercise considering that a VM can have 4 SCSI controllers and 15 devices per controller, but the numbers are still quite interesting. Let's look at what options you have. The following is true for all the Virtual SCSI Controllers that are supported on VMs at this time (LSILogic, BUSLogic & PVSCSI).

  1. VMDK (Virtual Machine Disks) approach. VMDKs have a maximum size of 2TB – 512 bytes. Maximum amount of storage that can be assigned to a VM using VMDKs is as follows: 4 controllers x 15 disks each x 2TB (-512 bytes) = ~120TB.
  2. Virtual (non pass-thru) RDMs approach. vRDMs also have a maximum size of 2TB – 512 bytes (same as VMDK). Therefore, the maximum amount of storage that can be assigned to a VM using vRDMs is as follows: 4 controllers x 15 disks each x (2TB – 512) = ~120TB
  3. Physical (pass-thru) RDMs approach. The maximum size of a pRDM since vSphere 5.0 is ~64TB. Hence, the maximum amount of storage that can be assigned to a VM using pRDMs (assuming vSphere 5.0) is as follows: 4 controllers x 15 disks each x 64TB = ~3.75PB

Of course, these are theoretical maximums & should be considered as such. I personally don't know of any customers who are close to this maximum size.

For completeness, there are some other options as well of course. I am aware of a number of customers implementing storage in a VM using the following techniques:

  1. iSCSI initiator in the VM (e.g. MS iSCSI Initiator – http://kb.vmware.com/kb/1010547). The amount of storage presented to a VM using this method will be down to initiator and Guest OS, but here disks can be passed directly into Guest OS. My understanding is that some of the Linux based iSCSI Initiators can have up to 1024 devices presented to them.
  2. NFS client in the VM (Linux/Unix Guest OS). Limitations on the amount of storage that can be presented to the VM will be down to NFS client and Guest OS, but shares/mounts can be passed directly into Guest OS.

Regarding RDMs (Raw Device Mappings), remember that you won’t be able to use a number of technologies if the customer goes with physical (pass-thru) RDMs, such as snapshots and the VADP APIs for backing up VMs. These technologies can still be used with virtual (non pass-thru) RDMs.

Do you have a VM with a very large storage footprint (100s of TB & greater)? I'd be interested in knowing how you've implemented it. Please send me a message in the comments.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

How to configure ESXi to boot via Software iSCSI?


VMware introduced support for iSCSI back in the ESX 3.x days. However, ESX could only boot from an iSCSI LUN if a hardware iSCSI adapter was used. Hosts could not boot via VMware's iSCSI driver using a NIC with special iSCSI capabilities.

It quickly became clear that there was a need for booting via Software iSCSI. VMware's partners are developing blade chassis containing blade servers, storage and network interconnects in a single rack. The blades are typically disk-less, and in many cases have iSCSI storage. The requirement is to have the blade servers boot off of an iSCSI LUN using NICs with iSCSI capabilities, rather than using dedicated hardware iSCSI initiators.

In ESXi 4.1, VMware introduced support for booting the host from an iSCSI LUN via the Software iSCSI adapter. Note that support was introduced for ESXi only, and not classic ESX.

Check that the NIC is supported for iSCSI Boot

Much of the configuration for booting via Software iSCSI is done via the BIOS settings of the NICs and the host. Ensure that you are using a compatible NIC by checking the VMware HCL. This is important, but be aware. If you select a particular NIC and you see iSCSI as a feature, you might assume that you are good to go with using it to boot. This is not the case.

Too see if a particular NIC is supported for iSCSI boot, you need to set the I/O Device Type to Network (not iSCSI) and then check the foot notes. If the foot notes states that iBFT is supported, then this card may be used for Boot from iSCSI (I'll explain iBFT later). Yes, this is all rather cryptic and difficult to follow in my opinion. I'm going to see if I can get this changed internally to make it a little more intuitive.

Steps to configure BIOS for Software iSCSI Boot

Now that you have verified that your NIC is supported, lets move on to the configuration steps. First step is to go into the BIOS of the NIC and ensure that it is enabled for iSCSI Boot. Here is how one would do it on a HP DL series:

Similarly, here is how you would do this on a DELL PowerEdge R710:


The next step is to get into the NIC configuration. In my testing I used a Broadcom NetXtreme NIC, which comes with a boot agent. Broadcom’s Multi-Boot Agent (MBA) software utility enables a host to execute a boot process using images from remote servers, including iSCSI targets. You access the MBA by typing <Control>S during the boot sequence:

 This takes us into the MBA Configuration Menu:


Select iSCSI as the boot protocol. The key sequence CTRL-K will allow you to access the iSCSI  Configuration settings. If iSCSI isn’t available as a boot protocol, it may mean that the iSCSI firmware has not been installed, or that iSCSI has not been enabled on the NIC. There are a number of different parameters to configure. The main menu lists all available parameters.

First select General Parameters. In this example, I am going to use static IP information, so I need to set Disabled for the TCP/IP parameters via DHCP and iSCSI parameters via DHCP parameters.

When doing the initial install, Boot to iSCSI Target must also be left Disabled. You will need to change it to Enabled for subsequent boots. I'll tell you when later on in the post. You should therefore end up with settings similar to the following:

Press <Esc> to exit the General Parameters Configuration Screen and then select Initiator Parameters. At the iSCSI Initiator Parameters Configuration screen, one would enter values for the IP Address, Subnet Mask, Default Gateway, Primary DNS, and Secondary DNS parameters as needed. If authentication is required then enter the CHAP ID (Challenge Handshake Authentication Protocol) and CHAP Secret parameters.


Press <Esc> to return to the Main Menu and then select the 1st Target Parameters. Enter values for the Target IP Address, Target name, and Login information. The iSCSI Name corresponds to the iSCSI initiator name to be used by the client system. If authentication is required then enter the CHAP ID and CHAP Secret parameters. Note also that the Boot LUN ID (which LUN on the target we will use) is also selected here.


Press <Esc> to return to the Main Menu and then press <Esc> again to display the Exit Configuration screen and then select Exit and Save the Configuration. That completes the BIOS configuration. We are now ready to install ESXi onto an iSCSI LUN via the software iSCSI initiator.

Steps to install ESXi onto an iSCSI LUN via Software iSCSI

After configuring the MBA parameters in the Broadcom NIC, you can now go ahead with the ESXi installation. The install media for ESXi is placed in the CDROM as per normal. The next step is to ensure that Boot Controller/device order is set in the BIOS. For Broadcom cards, the NIC should be before the CDROM in the boot order.

When the host is powered on, the system BIOS loads a NIC's OptionROM code and starts executing. The NIC's OptionROM contains bootcode and iSCSI initiator firmware. The iSCSI initiator firmware establishes an iSCSI session with the target.

On boot, a successful login to the target should be observed before installation starts. In this example, the iSCSI LUN is on a NetApp Filer. If you get a failure at this point, you need to revisit the configuration steps done previously. Note that this screen doesn't appear for very long:


The installation now begins.

As part of the install process, what could best be described as a memory-only VMkernel is loaded. This needs to discover suitable LUNs for installation, one of which is the iSCSI LUN. However, for the VMkernel's iSCSI driver to communicate with the target, it needs the TCP/IP protocol to be setup. This is all done as part of one the start-up init script. The NIC's OptionROM is also responsible for handing-off the initiator and target configuration data to the VMkernel. The hand-off protocol is called iBFT (iSCSI Firmware Boot Table). Once the required networking is setup, an iSCSI session is established to the target configured in the iBFT and LUNs beneath the targets are discovered and registered with VMkernel SCSI stack (PSA).

If everything is successful during the initial install, you will be offered the iSCSI LUN as a destination for the ESXi image, similar to the following:

You can now complete the ESXi installation as normal.

 Steps to Boot ESXi from an iSCSI LUN via Software iSCSI

Once the install has been completed, a single iSCSI Configuration change is required in the iSCSI Configuration General Parameter. The change is to set the 'Boot to iSCSI target' to Enabled.


 Now you can reboot the host and it should boot ESXi from the iSCSI LUN via the software iSCSI initiator.


  1. Make sure your NIC is on the HCL for iSCSI boot. Remember to check the foot notes of the NIC.
  2. Make sure that your device has a firmaware version that supports iSCSI boot.
  3. Make sure that the iSCSI configuration settings for initiator and target are valid.
  4. Check the login screen to make sure your initiator can login to the target.
  5. Multipathing is not supported at boot, so ensure that the 1st target path is working.
  6. If you make changes to the physical network, these must be reflected in the iBFT.
  7. A new CLI command, esxcfg-swiscsi -b -q, displays the iBFT settings in the VMkernel.


Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage