Home > Blogs > VMware TAM Blog > Monthly Archives: May 2012

Monthly Archives: May 2012

Regular Maintenance for your Virtual Environment

You probably know the scenario well. An issue occurs in your environment, you log a support call and between your engineers and the vendor the issue is solved, but not without significant costs incurred to the organisation. These are measured and can include time spent on the issue, resource comitted, service disruption time and cost and of course many other factors. All of these usually pale in comparison to the loss of confidence that can occur in the environment by users and management, and often the knee jerk reaction is to look for an alternative or how to create a more redundant environment.

The first choice, looking for an alternative is usually met with much skepticism by the operations team who will need to learn an entire new system and of course management often baulks at the costs in any case. The second option, looking for more redundancy usually get a lot more traction in the organisation and many ideas are identified to help resolve what is now the solution to a particular problem.

This article does not discuss either of these and of course both are worth considering when a failure occurs however often it is the softer more simple option that is overlooked, ongoing maintenance.

As a Technical Account Manager it is my job to support my customers and when they have an issue and ensure that is it resolved as quickly as possible. The best way to do this is to prevent the issue from occuring in the first place. This is my goal with all of my customers, and a regimented process of scheduled maintenance plans is paramount to ensuring this can happen. While I cannot prevent every issue I can try and ensure that issues that are already identified are fixed before a problem occurs.

If you have read my blog posting before you will know I am a real fan of a simple step by step guide to doing things, that way we at least have a starting point, so here is mine.

Step 1: Regular scheduled meeting (in person, conference call) with the main operations team members. I do not advocate including every business system owner in this as it should not be their input that is required to prevent any issues.

Step 2: Provide a Scheduled Maintenance Plan Window: This is basically a plan which is open to all that provides a regular window where operations can maintain the system. While we all strive for 4 or 5 9's we need to make sure our business system users are aware of the importance of maintaining their system which may require some downtime. Of course there are ways to minimize this which are beyond the scope of this blog post.

Step 3: Daily/Weekly/Monthly preventative maintenance plan: This is a simople list of tasks. The list needs to provide the task, who owns it (group), when it is done, and how to do it. I always recommend both automated and manual steps if possible for every task, and most important is a way to check that the task was completed successfully.

Step 5: Check the vendor's Knoweldge base, patch update: As part of the ongoing weekly meetings that you schedule and the preventative maintenance schedule that you have prepared one of your tasks should be to check the vendor's offical knowledge base or patch update site. It is important to evaluate these articles and patches in terms of their potential impact if they are applicable. I recommend a simple check of the itrems and then evaluate the impact of either doing or not doing the task on your environment.

Like all good plan's doing this at a scheduled time, with the correct people on an ongoing basis is the best way to prevent as many potential issues as you can in your environment. I would expect that many of the isues that we have to solve when they occur are often clearly documented in KB articles or patch updates that could have been applied before the problem even occured.

Achieving the dream of high uptime requires careful planning but forgetting to do the basics of scheduled preventative maintenance can unravel the best technology systems and leave you with an unplanned outage which is the worst place to be.