Home > Blogs > VMware vSphere Blog


Best Practice: How to correctly remove a LUN from an ESX host

Yes, at first glance, you may be forgiven for thinking that this subject hardly warrants a blog post. But for those of you who have suffered the consequences of an All Paths Down (APD) condition, you'll know  why this is so important.

Let's recap on what APD actually is.

APD is when there are no longer any active paths to a storage device from the ESX, yet the ESX continues to try to access that device. When hostd tries to open a disk device, a number of commands such as read capacity and read requests to validate the partition table are sent. If the device is in APD, these commands will be retried until they time out. The problem is that hostd is responsible for a number of other tasks as well, not just opening devices. One task is ESX to vCenter communication, and if hostd is blocked waiting for a device to open, it may not respond in a timely enough fashion to these other tasks. One consequence is that you might observe your ESX hosts disconnecting from vCenter.

We have made a number of improvements to how we handle APD conditions over the last number of releases, but prevention is better than cure, so I wanted to use this post to highlight once again the best practices for removing a LUN from an ESX host and avoid APD:

ESX/ESXi 4.1

Improvements in 4.1 means that hostd now checks whether a VMFS datastore is accessible or not before issuing I/Os to it. This is an improvement, but doesn't help with I/Os that are already in-flight when an APD occurs. The best practices for removing a LUN from an ESX 4.1 host, as described in detail in KB 1029786, are as follows:

  1. Unregister all objects from the datastore including VMs and Templates
  2. Ensure that no 3rd party tools are accessing the datastore
  3. Ensure that no vSphere features, such as Storage I/O Control, are using the device
  4. Mask the LUN from the ESX host by creating new rules in the PSA (Pluggable Storage Architecture)
  5. Physically unpresent the LUN from the ESX host using the appropriate array tools
  6. Rescan the SAN
  7. Clean up the rules created earlier to mask the LUN
  8. Unclaim any paths left over after the LUN has been removed

Now this is a rather complex set of instructions to follow. Fortunately, we have made things a little easier with 5.0.

ESXi 5.0

The first thing to mention in 5.0 is that we have introduced a new Permanent Device Loss (PDL) condition – this can help alleviate some of the conditions which previously caused APD. But you could still run into it if you don't correctly remove a LUN from the ESX. There are details in the post about the enhancements made in the UI and the CLI to make the removal of a LUN easier. But there are KB articles that go into even greater detail.

To avoid the rather complex set of instructions that you needed to follow in 4.1, VMware introduced new detach and unmount operations to the vSphere UI & the CLI.

As per KB 2004605, to avoid an APD condition in 5.0, all you need to do now is to detach the device from the ESX. This will automatically unmount the VMFS volume first. If there are objects still using the datastore, you will be informed. You no longer have to mess about creating and deleting rules in the PSA to do this safely. The steps now are:

  1. Unregister all objects from the datastore including VMs and Templates
  2. Ensure that no 3rd party tools are accessing the datastore
  3. Ensure that no vSphere features, such as Storage I/O Control or Storage DRS, are using the device
  4. Detach the device from the ESX host; this will also initiate an unmount operation
  5. Physically unpresent the LUN from the ESX host using the appropriate array tools
  6. Rescan the SAN

This KB article is very good since it also tells you which features (Storage DRS, Storage I/O Control, etc) may prevent a successful unmount and detach.

Please pay particular attention to these KB articles if/when you need to unpresent a LUN from an ESX host.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: Twitter @VMwareStorage

28 thoughts on “Best Practice: How to correctly remove a LUN from an ESX host

  1. Oscar Madrigal

    ### Masking LUN paths using PowerCLI (Tested on ESXi 4.1)
    Connect-VIServer -Server MyESX -Protocol https -User root -Password password!!
    $esxcli = Get-EsxCli
    $esxcli.nmp.path.list() | Select Device, RuntimeName | Sort RuntimeName
    naa.6005nnnnnnnn0019 vmhba34:C0:T0:L0
    $esxcli.nmp.path.list(“naa.6005nnnnnnnn0019″)
    $esxcli.corestorage.claimrule.list()
    $Prule=157
    $Padapter=”vmhba34″
    $Pchannels=0,1
    $Ptargets=0,1
    $Pluns=0,1,2
    ForEach ($Plun in $Pluns) {
    ForEach ($Pchannel in $Pchannels) {
    ForEach ($Ptarget in $Ptargets) {
    Write-host “Add Rule: ” $Prule ” for “$Padapter”:C”$Pchannel”:T”$Ptarget”:L”$Plun
    $esxcli.corestorage.claimrule.add($Padapter,$null,$Pchannel,$null,$null,$null,$null,$Plun,$null,”MASK_PATH”,$Prule,$Ptarget,$null,”location”,$null)
    $esxcli.corestorage.claimrule.load()
    $esxcli.corestorage.claiming.unclaim($Padapter,$Pchannel,$null,$null,$null,$Plun,$null,$null,$null,$Ptarget,”location”,$null)
    $esxcli.corestorage.claimrule.run()
    $Prule++
    }
    }
    }
    $esxcli.corestorage.claimrule.list()
    –> Perform LUN masking at the storage level
    ### Unmasking the path
    $Prule=157
    $Pluns=0,1,2
    ForEach ($Plun in $Pluns) {
    ForEach ($Pchannel in $Pchannels) {
    ForEach ($Ptarget in $Ptargets) {
    Write-host “Delete Rule: ” $Prule ” for “$Padapter”:C”$Pchannel”:T”$Ptarget”:L”$Plun
    $esxcli.corestorage.claimrule.delete($null,$Prule)
    $esxcli.corestorage.claimrule.load()
    $esxcli.corestorage.claiming.unclaim($Padapter,$Pchannel,$null,$null,$null,$Plun,$null,$null,$null,$Ptarget,”location”,$null)
    $Prule++
    }
    }
    }
    $esxcli.corestorage.claimrule.run()
    $esxcli.corestorage.claimrule.list()

    Reply
  2. Techstarts

    Specific to ESXi4.1, is the procedure same while decommissioning the storage. We are decommissioning existing datastore. We will storage vMotion our VMs to new Datastores. If you could please clarify.
    Also according to KB:1015084 I’ve not understood last section.
    “Run this command to unclaim each path to the masked device:
    # esxcli corestorage claiming unclaim -t location -A vmhba0 -C 0 -T 0 -L 2
    This ensures that all paths to the device are unclaimed successfully before running the claim. Update vmhba, controller, target and LUN number as required.”
    Why I have to unclaim the path when device is never going to in?

    Reply
  3. Chogan

    Thanks for commenting Techstarts.
    Yes, I would use the same procedure to decommission storage.
    My understanding for why you unclaim the device is that you remove it completely from the VMkernel. This means that the VMkernel will no longer try to look for that device while you are in the process of removing/decommissioning it from the host.
    If the KB is unclear, please leave a comment in the KB and ask them to clarify. These comments are all monitored, and should result in more detail getting added to the KB.

    Reply
  4. nate

    How does this relate to using RDM? I wrote back in 2009 how annoying vSphere 4.1′s new “intelligence” was with regards to removing LUNs (in my case RDMs) and re-adding them (I did this for SAN based snapshot stuff).
    http://www.techopsguys.com/2009/08/18/its-not-a-bug-its-a-feature/
    I have worked around the “feature” by using iSCSI directly in the guests bypassing vSphere entirely, this is not an ideal solution when I have hardware accelerated fibre channel available to every system.
    Basically my process was:
    1) unmount RDM from within guest
    2) remove LUN out from under ESX
    3) delete snapshot assigned to LUN
    4) create new snapshot
    5) create new LUN (same number as before)
    6) export to ESX
    7) re-mount volume in guest (i.e. no unmapping from guest, no guest reboot/shutdown etc)
    This was part of an automated process that would take a R/W snapshot from say a production Oracle or MySQL database and swing it around to another system for use.
    in ESX 3.0 and 3.5 this worked flawlessly hundreds of times, in vSphere 4.0 (haven’t tried 4.1 or 5.0 yet) this broke because vSphere assigned a new UUID to the new snapshot even though it was on the same LUN # with RDM! The docs for RDM explicitly state RDM is good for leveraging SAN snapshots and/or for doing backups, both of which I leveraged this ability with, and it broke horribly in 4.0. vSphere should stay out of trying to micro manage RDMs because they are just that – raw devices.
    I had similar UUID troubles (which I document in the above post) when dealing with NFS as well since vSphere 4.0 (again haven’t tried 4.1 or 5.0 yet) assigned a UUID based on the IP address of the NFS cluster (of which there was many), so if I relied upon DNS to access my NFS (which seems common) I could not do things like vMotion because host #1 may be using IP #1 and host #2 may be using IP #3 and vSphere would see those different IPs as different UUIDs (even though it is the same file system, same NAS cluster), and prevent vMotion from occurring.
    Can you revise your post to include instructions on how to do the same thing with RDM ? Preferably something that is simple.
    I filed support cases on both of these issues back in 2009 but got nowhere.
    Maybe it will ‘just work’ in 4.1 and/or 5.0, though I’m not holding my breath based on what support told me back then.
    thanks

    Reply
  5. Chogan

    Hello Nate,
    Thanks for the comment.
    My best guess for this behaviour is that in vSphere 4.0 we moved away from using the archaic vmhbaC:T:L:P naming convention to using actual SCSI identifiers like the NAA id. The older method would allow you to remove a LUN and present a different LUN back, and have ESX 3.x earlier think it was the same device. As you’ve now observed, this is no longer the case, with each snapshot getting allocated its own NAA id. This mechanism prevents original and snapshot copies of a VMFS being presented to the same ESX at the same time, and causing other problems.
    The behaviour is 4.0 will be the same in 4.1 & 5.0 as we now use NAA ids to uniquely identify SCSI devices.
    To work around this issue, could you not just remove the mapping file after unpresenting the old LUN/snapshot and recreate it when the new LUN/snapshot is presented? This way the mapping file will then have the correct reference (older mapping files used the vmhbaC:T:L:P which is why it worked).

    Reply
  6. Ravi

    In vsphere 5.0, It is worthwhile to note that prior to unmount using VC, it does a exhaustive check and only when everything checks out “Green” it will give you an option to proceed with the unmount.
    Big thanks to the team on making this a easier task rather than going thru hoops (masking and stuff as in ESX 4.x).

    Reply
  7. Ed Grigson

    One minor mistake – you can’t detach the device without first unmounting any datastores, whereas you’ve stated that a ‘detach’ will automatically unmount.
    Secondly is there any way to detach a device from a group of hosts using the VI client? Typically I have a LUN presented to a whole cluster and as far as I can see you’d have to repeat the detach for each host in the cluster or script it via the command line.

    Reply
  8. Chogan

    Thanks for the comment Ed.
    During my testing, I observed an automatic unmount when I did a detach, but perhaps it doesn’t work in all instances like you observed,
    There is no way to do a device detach from a group of hosts via the vSphere client. We will have this in a future release. Right now, you have to do it on a per host basis or script it. VMware created these scripts already. Check out KB article 2011506.
    Cormac

    Reply
  9. Michael

    What is the main difference between Delete and Unmount options for VMFS datastores? We are currently creating datastore clusters and migrating our VMFS3 datastores to new VMFS5. The old ones are then put into maintenance mode and we have been unmount and detaching them, though this is tedious. Since we have no data and the LUNs will be destroyed on the array after anyway, is it okay to just delete the datastore rather than the unmount and detach process? I haven’t been able to find this answer in my searches..

    Reply
  10. Keith Olsen

    What is the purpose of unregistering the VMs?

    I will be doing some maintenance work on an array which the hosts connect to multiple arrays.

    What are the potential problems if I just shutdown the VMs, mask out the LUN, then perform my upgrades. After upgrade is complete, remove the masking, and rescan.

    Thanks!

    Keith

    Reply
  11. Andreas

    Hi Chogan,
    today I made a desaster recovery test with a oracle DB Server with virtual connectet RAW devices.(for the database only)
    We made the ASM Oracle replikation for the database. Normally we can lost a LUN –> the Service of the db is still running.

    At the moment we use ESX 5.1b and when we lost a LUN the APD alert will apear –> Retry / cancel the connection and the VM stopped to work.

    Is this the solution for APD only and which problem will be apear with a RAW device ?–> The vmdk link is still avalible !

    Thanks for Reply.

    Andreas

    Reply
  12. Steve

    Hi Cormac – is this still the correct procedure with 5.1? I know that there were some APD/PDL improvements with 5.1 so I was wondering if detaching the LUN from the devices tab is still necessary.

    Reply
  13. Arild Skullerud

    I have a possibly unforeseen problem.

    I’ve had an ESX 5.0 host at site A accessing 15 LUNs in my SAN there.

    Then I moved that ESX host to site B. Site B has 16 LUNs, but I only see the last LUN. I guess this is due to the ESX ‘holding’ those 15 first LUN numbers in case the show up at a future date.

    I however would like to flush the list and have the LUNs show up like they normally do when I add a new ESX host.

    Any suggestions to do just that? I guess an unmount may not be the right choice?

    Reply
  14. Amol Rane

    Inspite of following all of the prerequisites per below url-http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2004605, its still failing to unmount , delete the VMFS datastore from ESXi 5.1 cluster, Continuously getting following error. “The vSphere HA agent on host ‘esx01.ad.com’ failed to quiesce file activity on datastore ‘/vmfs/volumes/4ea14076-9655b810-9180-001-0185f07ba’. To proceed with the operation to unmount or remove a datastore, ensure that the datastore is accessible, the host is reachable and its vSphere HA agent is running.”

    Reply
  15. Dave H

    Hi, we have esxi 5.1 U1, we are dreading removing some old clariion luns (about 100) and we have unmounted them so people cannot put new machines on them now that they are vacated.

    That said to go through all these hosts quick, can we just shut down the hosts 1 by 1 then unmask luns to the host, then boot up? Removing the detach would save us a lot of work… Or is the system still at risk of APD ?

    It has amazed me that SAN has been in this sorry state for so long, when trying to moven dozens of luns, it’s a LOT of work! The detach on all hosts cannot come soon enough!

    Reply
  16. MarcT

    Cormac,

    If one was trying to decommission a storage array (take it entirely out of the picture), does one have to mask the representation of the storage controllers themselves? In our case, our HP MSA2312fc presents the controllers as LUN 0 with two targets from each controller. We have successfully removed all of the datastore LUNS (1-8 in this case).

    Reply
  17. Vaughn

    Simply desire to say your article is as surprising.
    The clearness to your publish is simply cool and that i could
    think you are a professional on this subject. Well along with
    your permission let me to seize your feed to stay updated with forthcoming post.
    Thank you one million and please keep up the gratifying work.

    Reply
  18. googlecertifiedpartners.in

    The 2013-2014 Georgia Hunting Seasons and Regulations Guide is now available online and in print,
    announces the Georgia Department of Natural Resources’ Wildlife Resources Division.
    Biologists estimated as few as 600-800 grizzlies existed in the lower
    48 states and declining. ‘Tree stand manufactures can aid in prevention by
    providing more support for the hunters, particularly
    for the minimalistic stands such as climbing or ladder stands,’
    Mc – Gwin opines.

    Reply
  19. croatia4holiday.eu

    What you said was actually very reasonable.
    But, consider this, suppose you composed a catchier title?
    I ain’t saying your information is not solid., but suppose
    you added something that grabbed folk’s attention?
    I mean Best Practice: How to correctly remove a LUN from an ESX host |
    VMware vSphere Blog – VMware Blogs is a little plain.
    You should peek at Yahoo’s front page and see how
    they create post headlines to get viewers interested.
    You might try adding a video or a picture or two to
    get people excited about everything’ve written.
    In my opinion, it might bring your website a little bit more
    interesting.

    Reply
  20. Rosemary

    I absolutely love your blog and find nearly all of your post’s to
    be exactly I’m looking for. Do you offer guest writers
    to write content to suit your needs? I wouldn’t mind producing a
    post or elaborating on most of the subjects you write with regards to
    here. Again, awesome blog!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>