Virtualization: Make an informed decision: Distributed vs Simple Architecture

During my discussions with multiple customers in last few months, it seemed we discussed only about one major point. It was to decide between installation type of Distributed vs Simple Architecture. More often than not I have seen customers choosing a deployment model not based on careful considerations but based on what the general belief is. This blog tries to shed light on the considerations that should be made before making the decision between Distributed vs Simple Architecture. Any Solution Architect or Virtualization Architect should read this blog to discuss on the above point.

The general belief:

I have met enough customers to safely assume that more than 90% of times a customer asks us for a Distributed Architecture irrespective of the size of the environment or total uptime required from the solution. I agree to the fact that it undoubtedly has it’s own benefits like removing single point of failure. But the question is should we always choose Distributed Architecture simply because we can or for the stated reason or are there any other careful considerations involved?

My point is, it depends, and you will be surprised to know how many times I may recommend a simple install over a distributed environment.

Before you think I am talking totally baseless point, let us check the reasons for me saying so.

First point of consideration:
In most of the cases, all the management components do not directly affect the running workload. So a downtime of the management components should not have a direct impact to your existing running environment. During that time, you will not be able to manage or do new things in your environment. But this in no way is impacting your current SLA for the already running workload. This should be fine in most cases, but if someone has a public cloud and the main management portal goes down obviously it has bigger impact on business and should be avoided. This fact is more prominent in VMware environments. For example, let’s consider the following situations.

vCenter or PSC or both goes down – Existing VM’s keep running, no new deployment or management is possible. In case of vRA with vCenter, no new deployment at cloud level is possible as well, but the Cloud portal works. Even all the features like HA, Networking (distributed) etc. all continue working. Remember this is your main management component in VMware virtualization environment.
vRealize Automation components go down – Cloud portal is not available, existing workloads keeps running. You can still SSH or RDP to the VM’s hosted in cloud. End users operations are not hampered.

Here I am considering only these two cases as these two are responsible for building Virtualized and Cloud environment respectively. So from above, I can safely assume that availability of the management components are critical but does not immediately affect my already running workload.

Second Point of consideration:

Next, let’s explore both these deployment Architectures and their general effect more closely.

Distributed Architecture: First Let’s check the implications of a Distributed Architecture more closely.

Most of the times, because of the following two reasons this Architecture is chosen.

To remove a single point of failure (increase availability)
To support a larger environment (if a single node can support say 10000 elements then 2 nodes will support 20000 elements- load balancing)

A lot of times point two is not applicable. Very few of the cases you would find a customer exceeding the technical limitation of a product.

For example, how many times you have actually seen a single vCenter server supporting 1000 ESXi hosts and 15000 powered off VM’s in production? Or for that matter a single vCenter appliance taking care of 10000 powered on VM’s? I am yet to see one. Did you ever see a single ESXi host supporting 1024 VM’s or 4096 vCPU’s deployed in that host? Have you ever seen any customer who is actually touching or nearing to the technical limitations of a VMware Product? I doubt and would love to see one.

Besides, if you have an environment this big, then definitely Distributed Architecture is THE WAY for you.

Coming back to the point, hence it seems, the majority of times the reason a Distributed Architecture is chosen is to remove a single point of failure thus increasing availability.

So let’s consider a fully distributed architecture for a cloud environment built on vRealize Automation and see the effects it has on the environment..

For a fully distributed architecture of vRealize Automation, we need the following number of components:

Deployed vRA appliance – 2+
IaaS web server – 2+
IaaS Manager Service Server – 2+
IaaS DEM Server – 1+
Agent Server – 1+

The number beside the component denotes the minimum number of nodes required for Distributed Architecture.

A total of minimum of 8+ servers are required only for vRealize Automation (with HA in DEM and Agent, you need more or overlap the roles). Also for database you need the following.

MSSQL Server in HA mode – 2+

On top of that, if you consider the distributed vCenter environment, then you have the following requirements:

PSC – 2+
vCenter – 2+

So a total of 14+ VM’s. Of course I am stating the extreme case here and in all probability actual production environment will have less number of servers with overlapping features. But if you have a really big environment then this is the number.

All these components will have Load Balancer in front. So architecturally vCenter environment looks like following:

Distributed vCenter Deployment Architecture — vCenter Deployment- Distributed Architecture

Or more precisely and in more details:

vCenter Deployment - Distributed Architecture with Load Balancer — vCenter Deployment – Distributed Architecture with Load Balancer

And the vRealize Automation environment should be as given below:

vRealize Automation Deployment - Distributed Architecture — vRealize Automation Deployment – Distributed Architecture

The direct effect:

The placement of a Load Balancer in this architecture has a lot of effect in this environment. Let’s consider a physical load balancer in traditional environment, i.e. somewhere upstream after firewall (at least 2 or 3 hops away from the host on which the VM resides).

Now, let’s check how a normal user request for a VM is handled. A user request comes to the front Load Balancer (LB) and based on the decision, it goes to the respective vRealize Automation appliance. From there it again goes out to LB and comes back to a IaaS web server. Next the request again goes out to LB and based on the decision a Manager server is chosen and finally goes for DEM. The same story applies when the VM creation request goes out to vCenter, it reaches LB for choosing PSC and then vCenter node.

In all, considering all these multiple HOPS to LB think how many extra hops are taking place simply because of the nature of the deployment architecture. Considering the number of extra hops consider the performance effect on the overall response time.

Simple deployment architecture:
Now let’s consider the effects of a simple deployment architecture. For our discussion let’s consider the number of supported elements is well within the capability of a single node. In all probability, a simple architecture will have a single node for every component of the solution.

So now a request will not have to make so many round trips to LB. So for obvious reasons, response time should be higher than a fully distributed architecture. So you get higher performance.

But, the flip side is now you have a single point of failure. So let’s consider the different availability options to increase the overall uptime of a simple single node deployment architecture.

The first line of defense is underlying High Availability of Hypervisor with VM monitoring option. Typically, a physical node failure is sensed within 12 seconds and a restart takes place by 15 seconds. For the sake of discussion, let’s consider the OS and the application of the VM comes up within 5 minutes. Considering a node failover happens once in a month, total downtime is 5 minutes in a total of 43200 minutes (considering 30-day month). That means you get an uptime of 99.988%. Same goes for VM hung situation or application hung situation, as we are monitoring at the VM level as well.
Second line of defense is snapshot, if the OS or application gets corrupted we simply revert back to a snapshot. First let’s consider an external database is used, then there is not much change in the original VM, so recovering from a snapshot is sufficient and say it requires 20 seconds. So total uptime is now 99.999%. But if internal database is used, then simply reverting to an earlier snapshot is not enough. In this case we need to revert to an earlier snapshot to recover the OS and then we need to restore the database from the backup (we need to have a regular backup mechanism for the database). This will require more time, say 10 minutes. In that case your uptime is 99.977% (considering internal database and recovery time of 10 minutes).
Third line of defense is backups. If everything gets corrupted then you need to restore entire appliance from a backup which say take 30 minutes. So in this case you get a 99.931% uptime in a month.

So the final choice is based on required uptime. If the business can sustain a 99.931% uptime for management components (worst case scenario) and the total supported elements are well within the product limit range, then I will certainly suggest a simple install because of the following reasons:

Simpler to manage
Simpler to update
Will perform better (as compared to full distributed environment)
Better response time
Less complex

Conclusion:
At the end I would say, do not choose full distributed architecture simply because you can. Consider all the above points. Choosing a simple single node deployment architecture is not so bad after all.

Another point to note, if I need to build a fully distributed environment then I would prefer using a virtual load balancer like NSX Edge, which will be much closer to the VM’s than that of a physical one kept in a traditional architecture thus reducing round trip time.

I am simplifying an already complex topic and the final answer is, it all depends. Every environment and requirement is different and there is no single rule to follow, but do not discard a simple deployment architecture because of the “so called reasons”. Consider it seriously and it may be way better for your environment than the distributed architecture. Till then Happy designing and let me know your view points.

Note: The above discussion is from a virtualization/cloud perspective. It does not apply to traditional physical datacenter as in that case, recovery time for a physical server failure is much higher. And you can not ensure SLA in that case.

Related Posts:

Related Articles

Image mapping management in Aria Automation with idem provider for Aria Automation.

Discovering VMware Aria Operations for Networks with a Recap from VMware Explore Barcelona and Tokyo

Network Observability: Measure Strictly. Manage Successfully.

Improving Cloud Network Troubleshooting: New Research Unveiled at SIGCOMM 2023 New York City

Technical Deep Dive VMware Aria Operations for Networks 6.11