Welcome to a “behind the curtains” article that is written by the vRealize Network Insight (vRNI) engineering team, and highlights the technical challenges around scaling the massive data lake, that is vRealize Network Insight. We are focusing on the efforts that took place last year when vRNI changed one of its critical database components. This post is written by Gaurav Agarwal, Rohit Reja, and Debraj Manna.
Motivation and Background
Up to the vRealize Network Insight (vRNI) versions of last year, it uses PostgreSQL to store the configuration time-series data. This is the most critical information that captures datacenter discoveries and state changes. Examples of these are: discovered entities like Virtual Machines, Firewall rules, Switch Ports and flows. Examples of state changes for existing entities are security groups, VM-to-host mapping. The configuration data is the source of truth which is used by other system blocks – like analytics, planning, and search.
While PostgreSQL is an extraordinarily hardened and battle-tested database, it is typically a single-machine system. To scale a PostgreSQL database, the host-machine needs to be scaled-up with more CPU, RAM, and storage. In large-sized cluster deployments, there is a disproportionally high storage, memory and compute cost on the single machine hosting the PostgreSQL database. This makes it difficult to deploy vRNI in larger cluster sizes.
Finally, to achieve a transparent and robust High-Available (HA) deployment of the vRNI platform cluster, the ability to transparently replicate and failover Postgres is required. Replicating the data to 2+ machines results in a lot of extra storage requirements; currently available automatic failover mechanisms require significant hand-holding and are not suited for zero-SRE dark-deployment setting in which vRNI operates at customer premises.
It became clear that we need to change the underlying storage layer in a more fundamental way that natively supports distribution, replication, and HA and is extremely resilient to faults.
Rewriting history
Re-writes at a core layer like stores are infrequent and challenging exercises. These require significant planning and effort to ensure that both versions of the system being re-written can co-exist and have API parity (to the extent possible). This exercise mustn’t stretch for a long time as the task of keeping multiple versions in sync slows down development speed. At the same time, we have to be careful not to rush the new version out till it is tested enough for any unforeseen issues.
We used this opportunity to incorporate our learnings from running vRNI for the past five-plus years (since its Arkin days) to update the storage data model to be more robust and efficient. The new data model prevents inconsistencies by enforcing strict invariants for maintaining relationship integrity even in rare boundary conditions. It heavily exploits the constructs provided by the new storage layer to make targeted use-cases significantly more efficient and horizontally scalable.
Evaluation and Selection
We evaluated multiple technologies to (a) make PostgreSQL replicated + HA, and (b) replace PostgreSQL with a natively distributed storage layer. There are four classes of solution available to solve this problem:
- Database orchestrator systems that achieve an HA database system by running multiple complete copies of the database and using some variant of consensus protocols like PAXOS or Raft to orchestrate a single master amongst the multiple running database instances (e.g., https://github.com/zalando/patroni). This solution helps resolve the HA problem but leaves data distribution, and resource imbalance issues unresolved.
- Database clustering solution that will automatically shard the tables, and manage the writes and reads by appropriately routing the data and queries (e.g., Vitess). These require the data to be cleanly split-able – like customer information table, orders table for an online store. But for highly connected data-models like datacentre-graphs, such a split is generally not possible.
- Distributed RDBMS systems that attempt to provide a single homogeneous view of the entire database without any user-level sharding. This is a simpler model for developers as they do not need to split the data into hard-silos upfront (e.g., CockroachDB, TiDB)
- Distributed transactional Key-Value stores that do not provide many higher-level abstractions like SQL, and focus only on delivering high-throughput multi-row transactions with strong consistency guarantees (e.g., FoundationDB).
During the evaluation phase, we focussed on the following critical aspects: open-source, active community, robust and light on operations, scalable, and flexible (developer friendly).
We tried the approaches (1), (2) and (3) but found different serious issues that made these choices impractical for us. For instance, with approach (1) we found that the orchestrator frequently had race conditions in promoting and demoting master roles, leading to the divergence of commit timeline that requires manual intervention, incurring downtime and data loss. Approach (2) was not a viable option for vRNI as our data model is a time-series of a highly connected graph representing datacenter state-relations and evolutions, and these could not be cleanly sharded into isolated stores. For systems falling under approach (3), we found that most of the available systems had unclear programming models or unproven/confusing guarantees around consistency and transaction isolation. We also faced a lot of scaling issues with moderately sized clusters. As this field is relatively new in the open-source domain, some of the challenges we faced are not surprising and we hope that with time these aspects would mature and improve.
Experiment, experiment, experiment!
After a lot of experiments, we found that approach (4) to be most suitable for us (with the current set of available frameworks), as it gives us enough flexibility to model the vRNI data graph on it. Moreover, these types of systems provide a simpler abstraction – key-values that help them focus on delivering better scale, robustness, and consistency guarantees, and leave the development of higher abstractions to the users. While this approach does require more work from the users in the area of carefully designing and implementing data-models, it has proven to be the right approach for vRNI.
Out of different open-source systems that provide distributed key-value store abstraction, we found that FoundationDB to be a very mature and robust system, which is an essential attribute for the dark-environments where vRNI operates. Additionally, due to the lower level abstractions provided by FoundationDB, the team behind it has been able to provide very high levels of throughput with the strongest consistency levels (external consistency at serializable isolation level).
There was a steep learning curve when transforming the data-model to a key-value store. We had to think about many of the aspects that a typical RDBMS system provides trivially out of the box: secondary indexes, data toasts, data-integrity (I from ACID), and many others. We ran into multiple false starts that lead to issues like hot-spotting and non-optimal IO patterns. We had to design the entire data-model ground-up on top of raw key-value rows. However, this forced us to think hard about our access patterns and how to optimize those individually. Earlier, with a black-box approach of using RDBMS, we did not have enough control over the data placements and access patterns, and therefore, not think very deeply into those aspects.
Benefits to customers
With this change, we have been able to remove all single points of failures from vRNI. In clustered deployments, every piece of data on vRNI platforms now has at least two copies on different nodes. It drastically reduces data-loss possibility in the face of permanent node failures. Note, that a seamless vRNI HA solution with transparent failover of all tasks is still not yet complete, but this migration removes one of the biggest hurdles towards that.
Additionally, with this change, all the platform nodes have a completely balanced storage, processing, and memory requirements, thereby eliminating the performance bottlenecks due to a single resource-heavy node.
The new configuration store is designed using very low-level abstractions, giving us a very tight control and visibility into the access patterns. We have revamped our data modeling to be significantly more robust and efficient. It lends to better developer APIs to be used internally by vRNI for layering the higher-level use cases, which would result in an overall more efficient and scalable architecture. We are well-positioned to make further use of the improved data placement models to optimize vRNI data transformation pipelines in future versions.
Migration
Migration to the new database automatically happens as part of vRNI 5.1 (or later) upgrade process. The upgrade process verifies appropriate storage requirements before starting the migration. Depending on the number of rows and size of the PostgreSQL database to be migrated, the migration process may take from four hours from a small database to up to 48 hours for vast databases.
The process itself is robust to failures and will resume from the last automatic checkpoint in case the vRNI services are restarted due to any reason.
Closing with a quick look ahead what FDB enables
To summarize: the new store (FoundationDB) and the updated vRNI data model significantly improves a large class of performance and resource bottlenecks. It provides much better performance per node, and the ability to scale to very large cluster sizes for monitoring super-large datacenters.
This step also clears the path for advanced enterprise capabilities built natively into vRNI, without depending on external software: HA and cross availability-zones DR with zero data loss at the time of a failover.
The updated data-model provides tighter and efficient APIs to layers inside vRNI to build more advanced use-cases as we go forward.
Having a versatile system like FoundationDB at the bottom of the stack open up a lot of future possibilities: efficient metric time-series store (like Wavefront), or even an integrated blob-store. The primary motivation for such future directions is to have a common multi-modal store powering a lot of different services. This model, in-theory, simplifies and unifies a large number of common concerns associated with any enterprise system: backup, HA, DR, monitoring, and other operational aspects.