Home > Blogs > VMware TAM Blog > Category Archives: TAM-APJ

Category Archives: TAM-APJ

vForum Sydney Presentation – ensure your VMware environment is stable and available 100% of the time

A few weeks ago I was privileged to present a session at our annual vForum event in Sydney, Australia. The event was very well attended and the many customers I got to chat to afterward found the session valuable. I have attached the slides from the session to this posting if you would like to view them. If you require any clarification on any aspect of the session please feel free to email me.

Cheers

Neil Isserow – VMware Senior Technical Account Manager (Brisbane, Australia

)vForum 2013 Presentation – Ensure your VMware environment is stable and available 100% of the time

TAM Day Brisbane 2012 Wrap Up

Each year in Brisbane, Australia we host an event for our TAM customers to present to them the newest VMware solutions and thank them from the TAM program. This year we had another fantastic event in Brisbane with all of our customers represented. The participants were all kept well fed and hydrated with coffee, tea and cold drinks throughout the day as well as a fantastic lunch and chocolates and sweets to keep the sugar levels up.

When the event was over we asked our customers to rate the event and we had the highest rating we have received in the 4 years of running this event a 4.69 out of 5 rating which we are very proud of. We also gave a way a bunch of prizes including a seat on a VMware course, an iPad, a VMware T-Shirt and a movie gift voucher, all of which were enthusiastically received.

I look forward to the event again next year, thanks to all the presenters, our sponsors for the venue and of course our TAM customers, without you, well I wouldn’t have a job :)

Neil Isserow – TAM (Brisbane, Queensland, Australia)

Event Pictures: https://www.icloud.com/journal/#4;CAEQARoQMWh_KLMbpOBuPyIquTfrnw;53AB1285-0C9F-4B30-8A3A-28E786CA8D22

Regular Maintenance for your Virtual Environment

You probably know the scenario well. An issue occurs in your environment, you log a support call and between your engineers and the vendor the issue is solved, but not without significant costs incurred to the organisation. These are measured and can include time spent on the issue, resource comitted, service disruption time and cost and of course many other factors. All of these usually pale in comparison to the loss of confidence that can occur in the environment by users and management, and often the knee jerk reaction is to look for an alternative or how to create a more redundant environment.

The first choice, looking for an alternative is usually met with much skepticism by the operations team who will need to learn an entire new system and of course management often baulks at the costs in any case. The second option, looking for more redundancy usually get a lot more traction in the organisation and many ideas are identified to help resolve what is now the solution to a particular problem.

This article does not discuss either of these and of course both are worth considering when a failure occurs however often it is the softer more simple option that is overlooked, ongoing maintenance.

As a Technical Account Manager it is my job to support my customers and when they have an issue and ensure that is it resolved as quickly as possible. The best way to do this is to prevent the issue from occuring in the first place. This is my goal with all of my customers, and a regimented process of scheduled maintenance plans is paramount to ensuring this can happen. While I cannot prevent every issue I can try and ensure that issues that are already identified are fixed before a problem occurs.

If you have read my blog posting before you will know I am a real fan of a simple step by step guide to doing things, that way we at least have a starting point, so here is mine.

Step 1: Regular scheduled meeting (in person, conference call) with the main operations team members. I do not advocate including every business system owner in this as it should not be their input that is required to prevent any issues.

Step 2: Provide a Scheduled Maintenance Plan Window: This is basically a plan which is open to all that provides a regular window where operations can maintain the system. While we all strive for 4 or 5 9's we need to make sure our business system users are aware of the importance of maintaining their system which may require some downtime. Of course there are ways to minimize this which are beyond the scope of this blog post.

Step 3: Daily/Weekly/Monthly preventative maintenance plan: This is a simople list of tasks. The list needs to provide the task, who owns it (group), when it is done, and how to do it. I always recommend both automated and manual steps if possible for every task, and most important is a way to check that the task was completed successfully.

Step 5: Check the vendor's Knoweldge base, patch update: As part of the ongoing weekly meetings that you schedule and the preventative maintenance schedule that you have prepared one of your tasks should be to check the vendor's offical knowledge base or patch update site. It is important to evaluate these articles and patches in terms of their potential impact if they are applicable. I recommend a simple check of the itrems and then evaluate the impact of either doing or not doing the task on your environment.

Like all good plan's doing this at a scheduled time, with the correct people on an ongoing basis is the best way to prevent as many potential issues as you can in your environment. I would expect that many of the isues that we have to solve when they occur are often clearly documented in KB articles or patch updates that could have been applied before the problem even occured.

Achieving the dream of high uptime requires careful planning but forgetting to do the basics of scheduled preventative maintenance can unravel the best technology systems and leave you with an unplanned outage which is the worst place to be.

Managing Performance when migrating to a Virtual Infrastructure

It is one of the most frequently asked questions that I deal with in my duties as a VMware Technical Account Manager, how do I guarantee performance for my current physical system that is being virtualized. On the face of it there seems to be numerous answers and approaches that can be used to achieve this goal however before we can choose one we need to understand exactly what the question is.

This seems to be the fundamental issue, the question is often not well understood and either made too complex or trivialised. When the question is made too complex we are not able to find an approach that will ever satisfy the question and when it is trivialised we run the risk of not being able to satisfy the requirements that are needed to perform this task.

So where do we start. There are many excellent web resources on the technical aspect of this but I am going to focus on the logical steps that I would take in order to achieve the desired outcome.

Step 1: Understand the current environment that the system resides in. This should include everything from the physical location, connectivity and any political issues such as an unwillingness to virtualize in the past. Do not leave any stone unturned in understanding what you are dealing with no matter how simple it seems.

Step 2: Interview the system owner, business owner and anyone else that has a stake in the system, including if possible users, understand all of their requirements in terms of performance expectations. I do not at this stage get concerned with existing performance assumptions that have been made by any of the people I have spoken to yet as many of these are based on their pre-existing notion of the system and it's operation be it good or bad.

Step 3: Baseline existing performance: This is the most important step and without it there is really no way to understand what we are required to deliver. I disregard any preconceived ideas of performance from my previous interviews and only focus on the existing service and what it is doing in terms of performance. This is a fairly lengthy stage and has a number of sub components to it as follows:

- Identify the physical system performance characteristic in terms of internal system measures, these include the CPU, Memory, Disk, Network and any other relevant items that you are able to collect. It is recommended that you collect this info for at least 4 weeks or longer to get a good picture of these performance variables during many different loading periods.

- Identify the performance of the system with respect to how users will perceive it. This is the harder part as it requires you to identify the most optimal way to measure that performance that a user will experience. There are several approaches here but I suggest that whatever you use make sure it is repeatable and consistent to ensure the results are not skewed. You may want to use a synthetic transaction type test system or possibly a script based timed test. These tests of course need to be run over a good period of time during peak and troughs in the environment to again ensure you are not missing out on any potential peak load that could cause issues if not catered for.

Step 4: Document all of the baseline results and ensure that are agreed upon with the business owner of the system as these will be used later on to identify how we are doing.

Step 5: Identify the system requirements based on the documented performance criteria to migrate the system to in the new VIrtual Environment. This might be a bit tricky as you are still going to perform baseline tests in the new environment so this will be just an initial take on what is required. Whether you set the requirements too high or too low here should not be of concern as the next step should take care of getting this right.

Step 6: Monitor the performance of the new environment using the same method as step 3 above but now in the new virtual environment. This again needs to be done using both methods to ensure that all performance criteria are being fulfilled. At this stage however we might find that system owners/users have higher or lower expectations on the new system. This can be a cause for concern however I always suggest that we don't use any of these subjective suggestions on performance and rather just focus on the technical part of getting performance base lined still at this stage. The goal here should be to at least be on parity with the existing system unless we already know that a very different performance characteristic is being demanded, we will however take care of this in the next step.

Step 7: Identify new requirements: Based on the system we have created in step 6 we now have the opportunity to either increase or decrease the system's performance. This is a very subjective part of the new system and I think should require documented evidence of existing performance and the rationale for the new performance characteristic, often this is due to an aging system or possibly due to the system being under utilized and now finding that more users will be using the system.

Step 8: Make changes to the system to fit the new requirements. This will be quite iterative and require lots of testing until the system satisfies the existing or new requirements. This is the stage where we need to ensure sign off once all owners are comfortable with what we have achieved and we also need to ensure as part of that sign off we have performance base lined and we have used a tool that can be used in the future should we find that owners are complaining of performance issues.

We very rarely hear complaints of good performance but of course we regularly hear people saying that the new environment is much slower than the old system for whatever reason, be it factual or perception based. By following these steps we should be in a good position to check and hopefully refute this should it occur anytime in the future life of the system. Taking the time to do this will save you endless issues in the future when the inevitable complaint comes in about the performance of the system.

There is of course much more to this, I have only touched on what we should do if there will be an increase in system usage and this can be a major issue and one that requires a blog post all on it's own as it changes our approach from relying on a baseline of the existing system to trying to ascertain the growth pattern and requirements for this. A very tricky situation and one that I hope to cover in the future.

In the meantime don't treat any system migration project as simply business as usual but rather take the time to baseline and manage the performance of the new virtualized system to ensure that you have delighted customers.

 

My thoughts during a DR event

As a TAM one of the most important tasks I get involved in is pro-active planning for a DR event. This is something that has become quite important in the past few years for many of my customers as they push past their physical server to virtual server numbers and start to see a tipping point where they have more VM's than physical servers. Of course VMware has an awesome solution for the VMware environment, SRM however this post is not just about SRM and I want to look a bit deeper into what DR and BC actually require from a planning perspective.

Let's take a typical event that I experienced recently. A customer of mine had a complete failure at their main data center. They, like most of my customers have a mixture of physical and virtual servers located in the data center, the majority being virtual. Their virtual servers are protected by SRM. They choose to only protect critical systems for now and not everything due to cost concerns around replication of large volumes of data, this will change when they move to SRM 5 with vSphere Replication as an alternative. This seems to be fairly typical and definitely a point to note as we move to SRM 5 and vSphere Replication increasing our customers options for DR using SRM. Taking a step further of course we need to look at the rest of the environment and how it is prepared for a DR event. Removing the networking requirements and making the assumption that this will be made available at the DR site as required we need to analyse the physical systems. What seems to be normal here in customers that are over 60% virtualised is that many of their most important workloads may still not be virtualised. These systems may include databases, directory services and other extremely important systems. What is interesting when these are analysed is that these systems often perform the task of being a dependency for a particular system that is already virtualised.

So what do we have here. Well a number of observations thus far. Firstly in a highly virtualised environment it is fairly easy to protect the virtualised part using SRM, however most often only a subset is chosen. This subset is the most critical of the virtualised systems and they often have dependencies on other systems that are still physical. These systems are where the issue comes in.

If we have a DR event it is fairly easy to fail over to the DR site with SRM (a posting for another time on how this works). The issue however is what will happen with the dependencies on the other side to physical systems and what about desktops, how are they maintained in a complete outage situation. Are all of these dependencies mapped out and available in the DR site as part of the DR plan? How do we know if they are available when they are up and successfully ready to accept communication with the virtualised environment?

This then leads to the question, is our virtualised environment more avalable than our physical environment on which it may still depend for some of it's resources, and if this is the case are we even able to fail over to the DR site or is there any value in doing this.

When an event of catastrophic proportions occurs we don't have time to mess around and fiddle with systems trying to figure out what is virtualised, what is physical and how the DR event will take place in the new DR site. What seems to happen  most often is that the virtual environment is not simply failed over as it should due to all of these questions and potential issues and often all that occurs is downtime at the primary site until the event is over.

This seems like a major waste and for me having DR just to tick a box to appease the corporate auditors.

But really we need to do more.

I would love for all of my customers to virtualise 100% but this might not always be practical for a variety of technical or political reasons, so it is best to assume that there will be a proportion of the systems that are still physical. Here are some of my steps I would consider in a DR plan.

Step 1 is to have a solution in place that automatically identifies all of the systems in your organisation and their dependencies. Mapping this out and keeping this up to date automatically is key. This could also feed into your CMDB for consistency as well.

Step 2 is to identify the services that you wish to protect in the case of a DR event. Identifying services rather than individual systems will force you to focus on the components that make up the service and therefore need to be available for it to operate.

The next step is to identify exactly what systems are physical and what are virtual based on the services map and put both of these into the DR plan. The dependencies will dictate the run order and any special requirements on the DR side when an event takes place.

You also need to pay careful attention to desktops as this is the entry point to most systems, how will they access the DR site, will they have connectivity to the DR site, are they available at the DR site, are they physical endpoints with a fat client or are they thin client desktops with a centralised source for the desktops to maintain their state so that users are then able to connect and continue as before.

Finally if you are protecting an environment that is complex and does include physical and virtual systems, which I suspect most are, you need to perform regular functional testing that includes full reporting and testing at the DR site as the SRM test may not be good enough in these cases as it may only prove that your virtual environment can successfully fail over.

I would love all of my customers to simply virtualise all of their systems, that way we could have a much simpler plan but while this is not a reality at the moment for most we need to be pragmatic on how we approach a DR event, ensuring that no system that is required is left behind for any service we are protecting.

I would love to hear your comments on how this has gone for you if you have suffered such an event.