Approximating staffing ratios in a cloud organization as a logarithmic function of infrastructure metrics.
Customers who want to establish true cloud services based on VMware’s SDDC solution (or any other provider for that matter), realize that in order to fully leverage the technology, they need to adapt their IT organization.
More specifically, they need to setup a dedicated team – a cloud Center of Excellence (COE) to manage and operate their cloud services.
The structure and roles of that team are described in detail in ‘Organizing for the Cloud’.
During practically all Operations Transformation projects, a question frequently asked is: what is the optimum staffing level to setup this cloud organization (FTE a.k.a. Full Time Equivalent)?
The standard consultant answer is of course ‘it depends’. But in this blog, I will explain in more detail what “it depends” means in this context.
In an earlier blog, I described “10 key factors to estimate staffing ratios to operate platforms with vRealize Automation and vRealize Operations Manager”.
- Number of lines of business
- Number of data centers
- Level staff skill/experience
- Number of cloud services
- Workflow complexity
- Internal process complexity (includes support requirements eg 5 days/5 or 24 hour/7)
- Number of third party integrations
- Rate of change
- Number of VM’s
- Number of user dashboards/reports
Now these 10 factors, and probably hundreds of other factors will determine the complexity of the tasks that the cloud organization needs to perform and therefore, the staffing level. Clearly there are thousands of possible combinations of these factors. But if I want to see how the FTE count evolves with a single , easy-to-quantify parameter (such as number of virtual machines or any other ‘simple’ infrastructure metric’), we need to make strict assumptions to ‘tie down’ the other factors.
So let’s assume that we are looking at a single organization evolving over time; as time passes the number of virtual machines gradually increases, but so does the number and complexity of the services, as well as the demand for support coverage:
- Between 1 and 100 VM’s, the COE is running as a pilot, there are no support requirements, only a small number of services to run.
- Between 100 and 1000 VM’s., the COE is running cloud services regionally with some basic service levels.
- Over say, 30,000 VM’s, the COE is now running a global operation with 24/7 support requirement and a broad range of services.
Practical observation of a number of real-life examples suggests an evolution broadly similar to the logarithmic curve in figure 1. Now this is still a model that deliberately simplifies and ‘smooths out’ the FTE curve, but there are two practical implications:
- The staffing levels may rise most steeply at the beginning of the curve. When the organization transitions from a pilot to a fully operating COE, the staffing need levels rise significantly.
- The FTE curve flattens out then the organization matures and can handle high volumes. Once the COE is operating with a high level of automation with experienced staff, adding workload only requires a marginal increase to the FTE’s count.
In reality of course the complexity – i.e. the demand on FTE – never grows quite linearly.
We would see threshold effects. For example when we reach 300 worksloads, a new 24×7 service may be added to the portfolio, which requires a rapid increase in FTE.
- The faster rise in FTE will occur in the early stages of build-up of cloud services; this is ‘normal’ given that we see an increase altogether of the number of services and the service levels and therefore significantly increasing the demands on the cloud organization;
- Once well established and automated, the FTE level should only increase marginally with rising infrastructure volumes – your organization will have learned to cope with increasing quantities.
- We need to caveat that although the FTE curve may look broadly logarithmic, threshold effects are inevitable: new demands on service level (eg new compliance requirements, 24×7 etc) can create an ‘uptick’ in FTE without necessarily a prior ‘uptick’ in volumes.
What we have presented here in an intuitive model to understand how increasing volumes impact FTE. You are welcome to share your experience and perhaps refine this heuristic model.
Pierre Moncassin is an operations architect with the VMware Operations Transformation global practice and is based in the UK.