VMware vRealize Operations Manager 6.0 managed to bring a robust, scalable and resilient architecture in the platform design. In this post, we are going to talk about the platform design, which brings you the massive scalable architecture. But before we jump into the scalability, let’s look at the platform design and the form factor of vRealize Operations.
Platform Design
It’s a completely new platform and new architecture. In this release, VMware has moved from two VM vApp to a single VM virtual appliance. As a new initiative, VMware has created a newly built serviced based design for this release. So the capacity, performance and all plug-ins are now services that run across common services in the platform. This one virtual appliance contains all the services. There are 5 different databases that can be presented in a node. So, as a scalable architecture, we allow multiple virtual appliance to tag along in a single deployment. This will provide us robust HA as well.
Form Factor
The main problems we are solving are Scalability, Resiliency, and Extensibility. You will see this when you look at the form factor difference in between the earlier release and this release.
Version 5.x | Version 6.0 |
vApp that contains 2 virtual machines, that is, Analytis VM and UI VM | A single virtual appliance as unified platform. Can add more for HA and scalability |
It was meant for Scale Up | It is meant for Scale Out |
License management was through vCenter Server | License management is independent from vCenter Server |
Installable Remote Collector Only | Same virtual machine can be turned into Remote Collector |
It was running in BYODB concept | It has a built in DB |
Each install of the software will include the entire stack. Think of the grey box here as a virtual appliance or as a physical computer, but it represents one complete stack of the software. The disciplines of application work across this stack. The common services will be available upon which the rest of the product disciplines can be enabled.
*HIS = Historical Inventory Service.
Highly Scalable Architecture
This is a new architecture that scales out horizontally to support increased objects, metrics and concurrent users. From a deployment perspective, we want to hide the complexities of scaling out from the user, so we deploy the whole stack at a time. When one instance/slice of the stack runs out of capacity (CPU/DISK/Memory), we can spin up another and add more capacity. We can keep doing this as necessary to handle the scale.
This is a major change, the core analytics engine still remains, the core capacity engine remains, and they are just combined and made into a scalable cluster. Our design goal was not to rewrite everything, but to change the persistence layers to be elastically scalable with built in application availability.
You deploy vRealize Operations Manager as a cluster, containing one or more nodes. Each node in the cluster takes on a particular role: master, master replica, data, or remote collector. In this way, it provides High Availability (HA) against host and node failures.
Master Role
A single-node cluster contains one node, the master node and as its name implies, it manages the high availability of the cluster. Administration and data are located on the master node.
A multiple-node cluster includes one master node and one or more data nodes. The master node must be online before you configure any new nodes. In addition, the master node must be online before other nodes are brought online. This node can function as the NTP Server for the entire cluster. However, a best practice is to use an external NTP source that the master node can synchronize to.
Master Replica Role
vRealize Operations Manager supports high availability by enabling a master replica node for the vRealize Operations Manager master node. Replica Node receives redundant copies of data from Master Node.
When a problem occurs with the master node, failover to the replica node is automatic and requires only two to three minutes of downtime. Data stored on the master node is always 100% backed up on the replica node. To enable HA, you must have another node deployed in addition to the master node. When you deploy nodes as virtual machine, you must deploy the replica node on different hardware that the master node so that the replica node is physically redundant.
Data Role
Data nodes are the additional cluster nodes that allow you to scale out vRealize Operations Manager to monitor larger environments. Data nodes share the load of performing vRealize Operations Manager analysis. It has the core functionality of collecting and processing incoming data.
Remote Collector Role
A remote collector node is an additional cluster node that allows vRealize Operations Manager to gather more objects into its inventory for monitoring. Unlike data nodes, master nodes, and master replica nodes, remote collector nodes only include the collector role of vRealize Operations Manager, without storing data or processing any analytics functions. Formatting of data is handled by the adapter on the collector.
Use cases for deploying a Remote Collector node:
- Navigate firewalls
- Reduce bandwidth across data centers
- Connect to remote data sources (over the WAN)
- Reduce the load on the vRealize Operations Manager analytics cluster.
Only one type of user interface runs on the remote collector node, the administration interface. The product user interface does not run on the remote collector node. Also, remote collector nodes do not have high availability. Only the data, master, and master replica nodes have high availability.
High Availability
You can provide high availability to your vRealize Operations Manager cluster at different levels.
You can provide data availability by implementing appropriate RAID levels on your storage arrays. This availability protects against disk failures as persisted data is protected against loss, i.e., disk failure.
The high availability feature in a vRealize Operations Manager cluster protects you against system failures and service availability. The cluster can recover from a data node failure because redundant services allow for simple redirects to keep system running. It is like a web cluster, if one server is down, the load is transferred to another.