By Duncan Epping, Principal Architect.
Yesterday I wrote about HP/Lefthand joining the vSphere Metro Storage Cluster program. Today I want to inform you about something that I believe is rather important when implementing stretched cluster solutions. Hopefully all of you have read about the Permanent Device Loss (PDL) enhancements that were added to vSphere 5.0 Update 1. I wrote about it in-depth in an upcoming white paper and in this blog post, and Cormac Hogan wrote an excellent article about it on the vSphere Storage Blog. In summary:
“The Permanent Device Loss condition is a condition that is communicated by the array to ESXi via a SCSI sense code. It indicates that a device (LUN) is unavailable and more than likely permanently unavailable.”
By issuing this command to ESXi the storage array informs ESXi what the status is of the LUN and then action can be taken if and when this is configured. In the case of vSphere 5.0 Update 1 two specific settings need to be set in order for ESXi and vSphere HA to respond to a PDL scenario.
The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault”. This setting can be configured in /etc/vmware/settings and should be set to “True”. This setting ensures that a virtual machine is killed when the datastore it resides on is in a PDL state.
The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. This setting is also not enabled by default and it will need to be set to “True”. This settings allows HA to trigger a restart response for a virtual machine which has been killed automatically due to a PDL condition.
Why am I calling this out specifically? Well after exchanging some tweets and emails with HP I discovered that HP Lefthand does not issue a PDL (as we know it) but kills all iSCSI connections instead in the “losing site”. Lets assume for a second you have 2 sites and HP/Lefthand storage in both. If anything happens to the storage network in between the sites the storage system will give ownership of the LUN to site which you selected as the preferred site, or in HP terms that you designated as the “primary site”. Meaning that the losing site would not be able to write/read from that LUN anymore, it is unavailable! If a PDL would be issued all VMs would eventually be killed on the site that lost connection and automatically be restarted by HA on the site that was the preferred site. However as HP Lefthand does not issue a PDL the VMs will not be killed by ESXi. Now here is the funny part. If you had configured das.maskCleanShutdownEnable to “true” the VMs will be restarted on primary site, if you did not… the VMs will probably not restarted.
Imagine you have das.maskCleanShutdownEnable to true, what happens? For VMs that reside in the designated primary site nothing happens. VMs that happen to reside in the “losing site” will be restarted. However these will not be killed in the losing site. Meaning that you will have two identical VMs active on your network. One in Site-A and one in Site-B, only of those will have access to disk however. Just imagine your users are not aware and are still accessing and working on that VM which doesn’t have access to disk anymore, yes this could get ugly.
In this scenario DRS VM-Host affinity rules are key. Actually DRS VM-Host affinity rules are key in all vMSC implementations. Make sure your VM-Host affinity rules align with the “site affinity” / “preferred site” defined on the storage system for your datastores.
I hope this helps, if you have any questions though don’t hesitate to leave a comment.

Is VMware plans to provide an HCL of PDL compliant storage array ?
Not that I know Raphel. This is up to the individual Storage Vendor to provide those “implementation details”. I do understand it is an important part of it. I will ask internally what we will do with this.
I don’t think that this is correct. You only declare on site as primary in case of a multisite cluster without a failover manager. If it is configured like this and the inter site link goes down, the primary site will keep LUNs up, although there is no quorum. Anyway, the LUNs will not remain online on the other site, so HA cannot start VMs there.
I know VMs cannot be restarted on the site which doesn’t have access to the volumes. But what if VMs were already in this site? What if those VMs are now also restarted in the other site? you would have 2 copies of the VM running. One in Site A and one in Site B. Not a situation I would like to be in.
DRS Affinity Rules can help with that but are no guarantee. Hence some sort of PDL handling would be nice but that isn’t possible with this type of iSCSI array at the moment.
I can confirm we have the same issue with a Netapp metro cluster, it doesn’t issue a PDL when we have been testing a site failure.
To compound the issue most VM’s still appear responsive as I’m guessing they are running in memory. I’d be interested to know if there is anyway of making the VM shutdown in this scenario.
A half hung VM confuses thrid party products like Oracle RAC (it doesn’t see the disk as hung) or F5 load balancers it still sends requests.
unfortunately there is no real solution at this point for that problem Mike. As NetApp is a uniform solution one part of the solution will be up and no PDL is issued because of that but some paths are just reported as dead.
I have already spoken about this scenario with our engineering team and they are researching how to solve this in collaboration with the storage vendors.