By Michael Francis
Continuing on from Ahmed’s recent blog on DevOps, I thought I would share an experience I had with a customer regarding performance management for development teams.
I was working with an organization that is essentially an independent software vendor (ISV) in a specific vertical; their business is writing software in the gambling sector, and⎯in some cases⎯hosting that software to deliver services to their partners. It is a very large revenue stream for them, and their development expertise and software functionality is their differentiation.
Due to historical stability issues and lack of trust between the application development teams and the infrastructure team, the organization introduced into the organization a new VP of Infrastructure and an Infrastructure Chief Architect a number of years previous. They focused on changing the process and culture − and also aligning the people. They took our technology and implemented an architecture that aligned with our best practices with the primary aim of delivering a stable, predictable platform.
This transformation of people/process and technology provided a stable infrastructure platform that soon improved the trust and credibility of the infrastructure team with the applications development teams for their test and development requirements.
The applications team in this organization, as you would expect, carries significant influence. Even though the applications team had come to trust virtual infrastructure for test and development, they still had reservations about a private cloud model for production. Their applications had significant demands on infrastructure and needed to guarantee transactions per second rates committed across multiple databases; any latency could cause significant processing issues, and therefore, loss of revenue. Visibility across the stack was a concern.
The applications team responsible for this critical in-house developed application designed the application to instrument it’s performance by writing out flat files on each server with application-specific information about transaction commit times and other application specific performance information.
Irrelevant of complete stack visibility, the applications team responsible for this application was challenged with how to monitor the performance of this custom distributed application performance data from a central point. The applications team also desired some means of understanding normal performance data levels, as well as a way to gain insight into the stack to see where any abnormality originated.
Due to the trust that had developed with the infrastructure team, they engaged with them to determine whether the infrastructure team had any capability to support their performance monitoring needs.
The infrastructure team was just beginning to review their needs for performance and capacity management tools for their Private Cloud. The team had implemented a proof-of-concept of vCenter Operations Manager and found its visualizations useful; so they asked us to work with the applications team to determine whether we could digest this custom performance information.
We started by educating them on the concept of a dynamic learning monitoring system. It had to allow hard thresholds to be set, but also⎯more importantly⎯determine the spectrum of normal behavior based upon data pattern prediction algorithms for an application; both as a whole and each of its individual components.
We discussed the benefits of a data analytics system that could take a stream of data, and
irrespective of the data source, create a monitored object from it. The data analytics system had to be able to assign the data elements in the stream to metrics, start determining normality, provide a comparison to any hard thresholds, and provide the visualization.
The applications team was keen to investigate and so our proof-of-concept expanded to include the custom performance data from this in-house developed application.
The screenshot below shows VMware vCenter Operations Manager. It shows the Resource Type screen that allows us to define a customer Resource Type, which allows us to represent the application-specific metrics and the application itself.
To get the data into vCenter Operation Manager we simply wrote a script that opened the flat file on each of the servers participating in the application; it read the file and then posted the information into vCenter Operations Manager using its HTTP POST adapter. This adapter provides the ability to post data from any endpoint that needs to be monitored; because of this vCenter Operations Manager is a very flexible tool.
In this instance we posted into vCenter Operation Manager a combination of application-specific counters and Windows Management Instrumentation (WMI) counters from the Windows operating system platform the apps run on. This is shown in the following screenshot.
You can see the Resource Kind is something I called vbs_vcops_httpost, which is not a ‘standard’ monitored object in vCenter Operations Manager; the product has created this based on the data stream I was pumping into it. I just needed to tell vCenter Operations Manager what metrics it should monitor from the data stream – which you can see in the following screenshot.
For each attribute (metric) we can configure whether hard thresholds are used and whether vCenter Operations Manager should use that metric as an indicator of normality. We refer to the normality as dynamic thresholds.
Once we have identified which metrics we want to mark, we can create spectrums of normality for them and affect the health of the application, which allows us to create visualizations. The screenshot below shows an example of a simple visualization. It shows the applications team a round-trip time metric plotted over time, alongside a standard windows WMI performance counter for CPU.
In introducing the capabilities to monitor custom in-house developed applications using combinations of application-specific custom metrics, a standard guest operating system and platform metrics, the DevOps team now has visibility into the health of the whole stack. This enables them to see the impact of code changes against different layers of the stack so they can compare the before and after from the perspective of the spectrum of normality for varying key metrics.
This capability from a cultural perspective brought the applications development team and infrastructure team onto the same page; both teams gain an appreciation of any performance issues through a common view.
In my team we have developed services that enable our customers to adopt and mature a performance and capacity management capability for the hybrid cloud, which⎯in my view―is one of the most challenging considerations for hybrid cloud adoption.
Michael Francis is a Principal Systems Engineer at VMware, based in Brisbane.