Home > Blogs > VMware Support Insider > Monthly Archives: June 2010

Monthly Archives: June 2010

Some random, but useful thoughts

Another guest post from Tech Support Engineer Mike Bean speaking casually about a few observations and a couple of tips.

Good morning lords and ladies, I know I missed last week, but what can I say, you can’t keep a good technical support engineer (or me!) down! First and foremost: I absolutely MUST call attention to something that crossed my twitter feed a few days ago.

clip_image001

Great day in the morning, why is this not on the front page of the VMware support page in bold lettering? I’m sure you can probably imagine, this is a frequently requested thing. I don’t know if it was only recently automated, or if we just didn’t know it was there, but rest assured I’ll be talking to the building manager about having a 4 story banner sign hung on the side of the building overlooking US36 (in Colorado). (I’m kidding, we don’t actually need a 4 story sign, two stories will do just fine!)

On a side note, one of our employees fairly high in the knowledge base chain of command came to my humble cubicle this afternoon! VMware management has been nothing but supportive of my efforts, but it became clear fairly quickly that neither she nor I had a clear picture what, if any impact these blogs are having. I’ve said it before, I’ll say it again, I read my fair share of technology blogs and listen to quite a few podcasts, and generally, the best give the public what they want.

ESX or diet ESX

Since my arrival at VMware, I can say safely that a fair amount of time and energy has been put into the distinction between ESX, and ESXi, and I’ve had at least a few conversations with various customers to that effect. The unwelcome truth, is that it’s not an overly simple subject. Usually by the time a customer arrives in my queue, they have a few preconceptions, and it’s usually because their managers told them “X”, or our sales force told them “Y”; but it’s a subject worth exploring in greater detail. When you have a clear picture of how ESX is supposed to look, it’s generally a lot easier to fix when something’s wrong.

I alluded briefly to the service console in a previous blog. What, IS, the service console? We’ll let our product design engineers hash out a specific definition. For our purposes, it’s enough to think of the service console as a linux virtual machine that exists to help your manage your host. Think of it, like a maintenance hatch. Don’t confuse the service console with VMkernel, or vice versa. Usually, if you need to call global support services, it’s almost always a good idea to have some form of root access to the service console ready first! Learn how to enable root SSH login here.

Which more or less brings me to my first point, the primary difference between ESX, and ESXi, from an operations point of view, is that ESXi has no service console. It doesn’t take a specialist to see that poses certain security and footprint size advantages. The service console, is a potential point of weakness, and without it, ESXi can sometimes be considered inherently more secure. At least, that’s the typical argument for using ESXi over ESX. Why then, use ESX over ESXi? You may well have heard rumors by now, that VMware’s focus is on ESXi. To the best of my knowledge, those rumors are completely true. Why then, would anyone choose ESX?

Simple, the service console is a linux derivative. Generally, anyone who’s familiar with Red Hat, will probably feel right at home. That means that ESX (classic) is SUBSTANTIALLY easier to maintain. Commands and techniques that work on ESX, don’t work on ESXi, and vice versa. Ultimately, when customer asks me on the phone, “What should I use?”, I  generally tend to advise my customers to plan according to our commitments, not our conversations. For the time being, we are committed to and supporting, both ESX and ESXi. So most of the time, a consumer should choose between ESXi’s enhanced security benefits, and smaller memory profile, and ESX’s ease of use.

Please be aware, some men and women far smarter than myself are working very hard to bring ESX’s ease of use to ESXi, and I have every confidence that they WILL succeed. In the meantime however, when customers choose ESXi, I usually try to suggest to them that they familiarize themselves with certain tools.

ESXi lives almost COMPLETELY in RAMdisk!

This is relevant because EVERYTHING inevitably enters an error state sooner or later. It’s the nature of software. If ESXi does enter an unrecoverable state that forces you to reboot – it loses the contents of RAM, and becomes VERY difficult to diagnose. For this reason, I ALWAYS recommend the use of an external syslog server! (Syslog is a redhat protocol that ESXi can be configured to use.)

See our Knowledgebase article: Enabling syslog on ESXi

This isn’t the easiest thing in the world to configure, but when you do enter an error state that forces a reboot, you’ll have your system logs to fall back on, and that, is a substantial advantage! It only takes one outage incident for a syslog server to pay for itself in spades!

vMA

Familiarize yourself with the vMA (vSphere management assistant). Think of it as a portable service console for your ESX host. The vMA, and the CLI interface it contains, understands many of the same commands as the ESX service console. Use it to connect to your ESXi host, and you’ll regain a great many of the commands ESXi lost.

Check out: vSphere Management Assistant (vMA) [appliance for vSphere CLI, vSphere SDK for Perl, and SMI-S] 
-AND-
vSphere Command-Line Interface Installation and Reference Guide

Feel free to hate me while you’re learning it ;). I’ve been working in ESX administration & troubleshooting for years now, and even I still find the vMA/CLI a little challenging. That said, trust me, you might not like using it today or tomorrow, but there will come a day when you won’t know how you lived without it. Syslog and the vMA are essential tools, practically the Batman and Robin of ESXi administration, and your environment will almost certainly be better off for it!

That’s all for this week, to reiterate, in conclusion, don’t be shy about letting the knowledge base team know what you like! If we’re doing things right, let us know so we can keep on trucking! More importantly, there’s an awful lot of possible subjects out there, so if there are topics you feel are weak or just flat want to know more about, send us a shout-out/ping.

Live well!

NIC is missing in my Virtual Machine

Today we have a post about virtual networking from Ramprasad K.S., who is a senior tech support engineer in our Bangalore office.

Have you ever had a case where a virtual machine loses its configured NIC?

Background

In vSphere we introduced “Hot Add/Remove” for Network Adapters and SCSI controllers along with CPU and Memory. This means you can now add or remove these devices while a VM is powered on and the guest is running. This action is not limited to the management. These devices also show up as hot removable in the guest (in Windows you use the “Safely Remove Hardware” icon in the system tray).

2 reasons why a NIC will go missing in the Virtual Machine.
  • One reason is Hot Removal from the Guest. With the new Hot Add/Remove feature, NICs show up under the “Safely Remove Hardware” list. Any user with administrative privileges can accidentally remove the NIC using this feature. This is a common reason why the NIC has gone missing. This misstep results in numerous calls into support.
  • Another reason why NIC can go missing is someone manually removed it from the Virtual Machine configuration (Probably using UI or some SDK APIs).
How can we find out which one of the methods was used?

In both cases we can resort to the Virtual Machine logs to provide clues as to which one of these method was used.

NIC removed from VM using UI (Edit Settings)

In case of the NIC is removed using UI (“Edit Settings” for the Virtual Machine) then one would see API calls being logged in the vmware.log of the Virtual Machine. The log text would be similar to the following:

Mar 15 03:13:37.392: vmx| Vix: [466627 vmxCommands.c:1929]: VMAutomation_HotPlugBeginBatch. Requested by connection (1).
Mar 15 03:13:37.420: vmx| Vix: [466627 vmxCommands.c:1861]: VMAutomation_HotRemoveDevice
Mar 15 03:13:37.420: vmx| VMAutomation: Hot remove device. asyncCommand=3E10BA28, type=54, idx=1
Mar 15 03:13:37.420: vmx| Requesting hot-remove of ethernet1

The line immediately above indicates that the NIC removal was initiated by either an SDK API Call or UI and the following log segment indicates the Hot Removal completed.

Mar 15 03:13:37.463: vmx| Powering off Ethernet1
Mar 15 03:13:37.463: vmx| Hot removal done.

You may also observe the VM pause for a brief time to complete the removal.
Mar 15 03:13:37.447: vmx| Checkpoint_Unstun: vm stopped for 17696 us

NIC removed inside the Guest

In this case we will see slightly different log entries. There will be no indications of VMAutomation being involved here. The start of removal is identified by following lines:

May 27 16:38:52.903: vcpu-0| CPT current = 0, requesting 1
May 27 16:38:52.903: vcpu-0| CONFIGDB: Logging Ethernet0.pciSlotNumber=-1
Completion of Hot Removal can be identifed with same logging message as the one in earlier case.
May 27 16:38:53.417: vmx| Powering off Ethernet0
May 27 16:38:53.418: vmx| Hot removal done.

Note:  NIC removal is always a user initiated process either outside of the Guest (using UI) or inside the guest. There are no other reasons why a NIC should go missing from Virtual machine configuration.

How to Stop the NIC from being removed
  • Hot Add/Remove has to be disabled at each the level of Virtual Machine. At this time we don’t have any global configuration that would be valid for all Virtual Machine at ESX/vCenter Level. The parameter which controls the hotplug nature of the devices is devices.hotplug. Please follow the Knowledge base article 1012225 : Disabling the HotPlug capability in ESX 4.0 virtual machine

Note:Remember disabling hotplug means you can neither add not remove a device from virtual machine in powered on state.

  • For Guests running Windows operating systems, we can use a registry hack to hide the hot removable capabilities of the NIC. Be careful following this method as it uses potentially dangerous registry editing. Please backup your registry before proceeding with any edits.
    • Run regedit as Local System account. One way to do this is to run “at <current time + 1 min in 24 hr format> /interactive regedit.exe”, without the quotes. Something like “at 00:33 /interactive regedit.exe”
    • Now go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum, search for E1000
    • Set the Capabilities flag in the key(s) found above, to the current value – 4.

For example, we have the key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\PCI\VEN_8086&DEV_100F&SUBSYS_075015AD&REV_01\4&47b7341&0&1088 with the Service value E1000. Capabilities is set to 6. On changing the value to 2, (immediately) E1000 NIC will be no longer listed in the safely remove hardware list anymore.
If the guests are part of Domain, you might be able to push these changes to the system registry for the guests.

VMware Support Twitter tweaks

We like to think we’re a pretty progressive support organization at VMware, so it should come with no surprise that we have been engaged in various social networking activities for some time now. This very blog came out of our strategy to communicate more openly with our customers and both our customers and VMware are better for it.

We have been using Twitter too in two different capacities. @vmwarekb was created to allow real time updates from support about new Knowledge Base articles as well as providing weekly digests, highlights and alerts.

@vmwarecares has been around for a while with a mandate to be on the constant lookout for customer care opportunities, and point people to the right resources for self-help when they were looking for technical information.

Over the next months you’ll see a shift in how these services are divided between the two. We’re going to move the techie stuff away from @vmwarecares and it will be a customer service only contact point. @vmwarekb on the other hand is going to become more interactive than it has been in the past. We hope this division of duties will be simpler for customers to understand and at the same time will allow us to expand service in both.

If you have any comments about our twittering or blogging activities, we’d love to hear them! Let us know in the comment area below what YOU think!

VMware Snapshots

Today we have another guest post from Tech Support Engineer Mike Bean speaking casually about snapshots, a commonly misunderstood piece of the VMware solution.

If you ever want to make your VMware support representative cringe, just tell him or her you’re calling about a snapshot problem. Snapshots are very high on the list of misunderstood features, and to complicate things, snapshot problems often result in data loss, and let’s be honest, data loss is never funny.

ESX anatomy 101

To understand how snapshots operate, it’s important to understand the composition of your average virtual machine. To be sure, various virtualization architectures exist, but VMware’s is fairly straightforward. Every virtual machine consists of two parts, a *.vmx, and a *.vmdk. You’ll fairly frequently see other components, but in the end, if you do not have a *.vmx, and a *.vmdk, you don’t have a virtual machine. As we dive a little deeper, the *.vmdk consists of two parts:

1) <File>.vmdk – This, in the jargon, is called the descriptor. It is, what it sounds like. This is the file that contains the characteristics of the disk, if it’s lost, it can be re-created.

2) <File-flat>.vmdk-flat – This is the actual disk. This is the money file. It is the deal breaker. The buck very definitely stops here. If the data is damaged, do not pass go, do not collect $200, just restore from a backup.

So, ultimately, our metaphorical VM will look something like this:

VMware Snapshots

Next, let’s add some secret sauce, and start taking some snapshots. ESX creates another descriptor, and starts creating a “changes” or delta file. The “changes” file is a continuous record of the block level changes to the disk. This is an important concept. A VMware snapshot, unlike SAN based snapshots, ARE NOT COPIES. Most everywhere you look, a snapshot, is a copy or an image. The typical assumption is that if something goes wrong with your disk or backup, you can revert to the image. In ESX, that dog won’t hunt.

As you continue to work, your changes are recorded in the delta file. If the original disk is hypothetically damaged, you CANNOT revert to the snapshot, because the snapshot is not an autonomous disk; and removing the changes will not repair the damage. (We can’t always know what caused the damage in the first place).

Let’s add some additional snapshots to the mix. Take an additional snapshot, and what you’re really doing, is tracking the block level changes between the first snapshot, and the VM’s current state. It doesn’t take a VMware Technical Support Engineer to see how this can get out of control very-quickly. We call these structures “snapshot chains”.

VMware Snapshots

Take a look at our snapshot chain. Let’s, for argument’s sake, poke a hole in it, and damage one of the delta files. UX/LX administrators out there will probably remember their old textbooks that discuss the difference between absolute and relative paths. The “changes” files are relative paths, and because one of the “mile markers” is now, for want of a better term, damaged, ALL of the changes data below the damage is now suspect.

Generally speaking, have a problem with snapshot 3, and you’re fine, just revert to snapshot 2. If you have a problem with snapshot 2, snapshot 3 is now entirely unreliable, because the changes it records, no longer apply. Have a problem with snapshot 1, and snapshots 2 AND 3 are now suspect!

I’m sure you can see how this could lead to some unhappy people having unpleasant conversations! To illustrate, an office co-worker of mine got a call once from a company trying to recover a corrupted/damaged base disk that had YEARS worth of snapshots. It didn’t end well.

By now it should be readily apparent why snapshots do not make good backups. More to the point, it’s just not good digital asset management. A good backup infrastructure has to be able to stand on its own two feet, a spare tire in the trunk won’t help you if the check-engine light in your car comes on. In that sense, I’d like to propose an alternative way of thinking about the subject.

Engineers/software nerds fairly commonly use a concept, for want of a better term, we’ll call it version control. The code exists in a main branch or trunk. Write a new feature, or code a new bug-fix, and check the new code into the “build”. If the new bug-fix doesn’t work out, back it out. Use the build prior to the fix, however, ultimately, if the new bug-fix DOES work out, that, in essence, BECOMES the new build.

I can’t even begin to describe how many support calls to Support could be avoided completely with simple, faithful adherence to this principleHumbly, I suggest we emulate this kind of thinking. Use snapshots not to create backups for your VM’s, but as a form of version control. Snapshots are intended for short term use only. Got an OS patch coming for a critical VM? Take a snapshot and wait a couple days, perhaps a week. Once you’re certain the patch is viable and won’t cause excessive disruption, remove the snapshot! I spoke to a customer once who had setup something he called his “nag script”. It routinely checked for the presence of snapshots older then a given interval, and began emailing the VM’s custodians on a regular basis to remind them to remove it. SMART. If I’d had an ESX infrastructure of my own, I would’ve asked if he’d be willing to share the code for his “nag script”. I can’t even begin to describe how many support calls to Support could be avoided completely with simple, faithful adherence to this principle. Don’t misunderstand me, when used as version control, snapshots can be a powerful tool. My primary goal in writing this article is not to discourage snapshot use, but to encourage responsible snapshot use, and try to impart some sense of WHY it’s important. I’ve said it before in previous articles and I’ll say it again, ultimately, the only safe policy is one of shared information (informed consent). I’ve spoken with numerous customers over the years who’ve viewed their support calls as an opportunity to learn/ask questions, and I’ve always tried to encourage that attitude. Sometimes they want to understand what I’m doing, sometimes they want to record the webex session, sometimes they just want to take notes, and we do our best to respond in kind! Ultimately, we’re all on the same side!

Until next time!

Addendum: Special thanks to Lisa Bernhardt (GSS, Storage Team) for helping translate!

Scheduled Maintenance June 18

VMware will be performing a system upgrade to several VMware Web applications on June 18, 2010. Maintenance will begin at 6:00 p.m. Pacific Time and end June 18, 2010 at approximately 11:59 p.m. Pacific Time.

While this upgrade is in progress, you will be unable to:

  • Access or manage your VMware account
  • Submit support requests online 
  • Download, purchase or register VMware products Manage VMware product licenses.
  • Access to VMware Communities

If you need to file a support request while the upgrade is in progress, call VMware Technical Support for assistance.

We appreciate your patience during this maintenance period. These system upgrades are part of our commitment to continued service improvements and will help VMware better serve your needs.

VMware ALERT: View customers using PCoIP are advised to NOT apply Update 2 to ESX 4.0 (yet)

image Earlier today VMware became aware of an issue affecting users of VMware View after applying Update 2 to their ESX 4.0 hosts. The problem only effects PCoIP, RDP works normally. There is a discussion of the problem in the VMware Communities here.

While our IT Teams work to resolve the issue, the Knowledge Base Team has responded by creating an up-to-the-minute live document at: http://kb.vmware.com/kb/1022830 and using @vmwarecares and @vmwarekb Twitter accounts to alert customers.

This Knowledge Base article will be updated as new information becomes available. If you have been affected by this, please read the KB.

We apologize for any inconvenience this may have caused you. If you know how to spread the word to your friends and colleagues, please do so.

To Patch, Perchance to Upgrade

Today we have another guest post from Tech Support Engineer Mike Bean.

'Morning everyone, I must take a moment to say thank you to people, both internally and externally, who’ve expressed support for the first column I wrote. To quote a web comic I enjoy, “we must learn, lest we stagnate”, if readers have enjoyed it as well; then I take that as a compliment!

At the risk of digressing from the “most wanted theme”, I wish to approach a new subject today. By the time you read this, ESX vSphere update 2 will be publicly available.

http://downloads.vmware.com/d/info/datacenter_downloads/vmware_vsphere_4/4

I gladly extend congratulations to all our development and QA teams the world over. I can say with complete sincerity they’re my heroes, for it is on their backs that our product is built, one feature spec and bug report at a time. Congratulations lords and ladies, hug your significant others, have a beer, and enjoy the moment. You’re our warriors!

I began my morning with a soda and a copy of the release notes, and I can safely say, it’s not light reading. Speaking as an army-ant in Global Support Services, I don’t see any issues we’ve been breathless with anticipation for, but there’s also far too many things being addressed to engage in any sweeping generalization.

http://www.vmware.com/support/vsphere4/doc/vsp_esx40_u2_rel_notes.html

In the course of my time here, I’ve often been asked the eternal question “should I get the patch?” I’m never quite sure how to answer this question, but it’s honestly one worth asking. So it’s worth taking a moment to examine.

Asking a software company if you should apply the patch is a little like asking a lawyer if you should sue; let’s face it, we have a slight bias. On one hand, if we didn’t think our customers would benefit, we wouldn’t have released the patch in the first place. On the other hand, many of my customers are system admins, and I’ve walked a mile in their shoes. In that sense, I’m well aware that they don’t have the liberty of applying patches/FW flashes whenever any number of numerous vendors they do business with, releases the latest update. My college economics classes would’ve called it “opportunity cost”. Downtimes must be scheduled, approvals must be obtained, benefits assessed. It is precisely because I have experienced both the software development point of view, and the system administration point of view, that I’m well aware that many of our customers may not have had problems. Risk is relative, myself and my co-workers routinely speak to customers who’ve literally ran for years without issues. Why then, upgrade?

Typically, when a customer asks me that question on the phone, I often end up trying to explain that I can’t really answer that question. It’s the natural course of software development that a changing code base means a changing landscape. Old problems are solved, and new ones arise, and I won’t imply to the contrary. However, that should not be interpreted as carte blanche to never patch. Risk may be relative, but as available security and stability fixes accumulate, so does risk, and so does benefit.

It’s a matter of risk assessment, as a GSS TSE (technical support engineer) it’s my responsibility to try and help present the facts and the options, but ultimately, the final decision always belongs to the customer. Only they know their networks, and basically, only they can realistically decide when the benefits of patching will exceed the costs. Inevitably, the only safe policy is one of shared information (informed consent); In that spirit, I encourage most everyone I speak to familiarize themselves with the available resources, both in documents and communities. Examine the facts for yourself, and let the update speak for itself.

In closing, I would briefly highlight at least one real case from memory. I spoke with a customer some weeks ago who’d been having substantial issues with hangs on his cluster of Dell 2900s. I sincerely hope he’s watching update 2’s contents, very carefully!

http://www.vmware.com/support/vsphere4/doc/vsp_esx40_u2_rel_notes.html#resolvedissues

VMware ESX might fail to boot on Dell 2900 servers
If your Dell 2900 server has a version of BIOS earlier than 2.1.1, ESX VMkernel might stop responding while booting. This is due to a bug in the Dell BIOS, which is fixed in BIOS version 2.1.1.

Workaround: Upgrade BIOS to the version 2.1.1 or later.

You spoke, and we were listening! Till next time!

Important Knowledgebase article regarding vSphere Client and latest .Net patch from Microsoft

Users have been reporting problems opening the vSphere Client after applying patches from Microsoft on their desktops.

The error presented when trying to open the client is:

Error parsing the server "<servername>" "clients.xml" file.
The type initializer for VirtualInfrastructure.Utils.HttpWebRequestProxy’ threw an exception.

VMware has responded with the following Knowledgebase article:

vSphere Client does not open on any Windows operating systems with the error: parsing the server "<servername>" "clients.xml" file (1022611)

We sent out the following alert via the VMware Support Toolbar which takes you to the KB article as well as keeping our friends on Twitter updated.

 

vSphere Clients, prior to the Update 1 release, cannot be used to access the vCenter Server or ESX hosts. A Microsoft update that targets the .NET Framework, released on June 9th 2010 is causing this issue

Help, my hosts/VM’s are disconnected

Today we have a post from guest blogger Mike Bean, who is a support engineer at VMware in the Broomfield, Colorado office. Mike plans to provide us a series he terms “Most Wanted”.

Good morning VMware aficionados. As a VMware TSE (technical support engineer), I consider my responsibility to be to help my customers manage their networks. It’s rather ironic; if I’m doing my job-properly, fewer people need to call us!

I’d like to discuss our “Most Wanted” wall (I’m not speaking literally; we don’t actually have a wall with pictures of various glitches and posted reward money!). I intend to share some of our most common issues, with the hope that these articles may help you avoid some common mistakes. Most customers who send me service requests, are not advanced ESX users. They typically have advanced backgrounds in Windows/UX/LX, and usually would like to learn more about ESX, but haven’t had the time or the opportunity. Accordingly, I will assume a level of core competency/troubleshooting skills, and little to no ESX background. We know well the semi-haunted look of the harried sysadmin who unwillingly had ESX dropped into his or her lap with little preparation. On to today’s topic-

ESX anatomy 101:

One fairly common mistake is to assume the IP address of your host is, in fact, the IP address of ESX. Check any of the common VCP study guides, and they’ll all emphasize the point, ESX and the “service console” are not the same thing.

ESX consists of two discreet entities, the “service console”, and for want of a better term, VMkernel (ESX). The IP address of your host is in fact, the service console’s. Think of it like a linux virtual machine with a specific purpose. From the end user’s/administrator’s point of view, the service console exists to help you manage your host. (a maintenance hatch) ESXi, on the other hand, does not have a service console, but we’ll save that rabbit hole for another time. The important takeaway point here is that many ESX problems actually stem from issues on the service console.

Sample Scenario

“Help, my hosts/VM’s are disconnected!”

Borrowing a line from the venerable Douglas Adams, don’t panic. Disconnected means exactly that. Disconnected does not mean OFF. ESX communicates with virtual center through what is loosely described as, “management agents”. When you hear TSE’s talk about “management agents”, we’re really talking about 3 things: vpxa, hostd, and vpxd

Vpxa lives on the ESX host (on the service console), it communicates with vpxd. It’s mostly a listener service, and is very rarely an issue. Hostd lives on the ESX host (on the service console). This is the lion, the vast majority of “disconnects” indicate a problem with hostd, and lastly vpxd lives on your Virtual Center server. These 3 services form a communications chain, and failure of one or more of these services, tends to produce “disconnects”. The good news is, because the service console and VMkernel are separate, and VMkernel does the real work within the ESX, problems and changes can and do occur on the service console without affecting your virtual machines. This is not to say the management agents will never affect VMkernel, what I am suggesting is trust, but verify. Ping your hosts, ping your VM’s. Remote connect to them both. Try plugging your host address directly into your VSphere client. Success means the problem is probably on your virtual center server (vpxd). Failure means the problem is likely on the host (hostd, vpxa).

To troubleshoot in greater detail, it is important to understand a few things. If the GUI is “disconnected”, and we can’t issue commands to ESX through the client, we need to do it another way. Enter the Service Console.

Enabling root SSH login on an ESX host (8375637)

Security admins the world over cringe at that article. As well they should – Root access is not for the timid. So it’s important to consult your organization’s security policies before permanently leaving root enabled on a service console. They can and do frequently require use of separate accounts, which can then switch-user, or “su” to root. Root is necessary however, to restart the management agents.

Restarting the Management agents on an ESX or ESXi Server (1003490)

Restarting the management agents isn’t a catch all, but if they’ve failed or stopped, it’s a good step in the right direction.

I like to try and illustrate things with actual case examples. Here is a real-world example of employing this knowledge-

Company A had a planned outage, and discovered to their chagrin the ESX cluster wouldn’t connect to Virtual Center! We could enter the IP address of individual hosts into the VSphere client, and connect to each host just fine. That meant that  hostd was running, but could not connect with vpxd on Virtual Center. Long story short, both their primary DNS servers were virtualized, and some investigation revealed they were down. The management agents are fairly DNS sensitive. Manually starting their DNS servers, and restarting the management agents on the host allowed them to reconnect to Virtual Center, and power up the rest of the hosts. Case Solved!

Top 20 Articles for May 2010

Here is our Top 20 KB list for the past month. The list is ranked by the number of times a Technical Support case was resolved by following the steps in a published Knowledgebase article.