The Data Locality Discussion (revisited Part 1)

Back in 2014,  made the case explaining how Virtual SAN took advantage of data locality with both temporal and spacial locality. He explains how intelligent prefetch algorithms save disk, and reduce latency. The Virtual SAN caching white paper goes into more detail on this.

A key point in this paper was that Virtual SAN did not migrate persistent data to the local flash. This type of data locality requires that  behind every vMotion or HA event in your environment, should be a massive flood of networking traffic that can and will significantly impact latency. The supposed benefit of this is a reduction in storage  latency or network throughput. Rawlinson clearly outlines in this paper that the latency of modern flash devices at queue depth is sufficiently high to make the added network hop a non-germane argument. Requiring customers to disable or turn down Distributed Resource Scheduling (DRS) to improve storage performance shows the limitations of this design. Many competitors leveraging this are quietly telling customers as they scale to turn off DRS. Our own lab testing has reveled that this to be a choice between bad consolidation ratio’s (DRS off) or highly inconsistent latency (Leaving it on and accepting the consequences of data locality). Discussions with backup vendors point to data locality significantly impacting performance of both backup and recovery at small and large scale. Significant compromises are often required to make full backups not cripple storage latency on these competing solutions.

Higher latency increases the processing time for workloads, and negatively impacts the end user experience. Inconsistent latency is even worse, as users will find seemingly random sudden losses of performance to their applications jarring and disorienting, and latency sensitive applications may experience outages. It should also be noted that these migrations do not improve latency for writes, as data must always be written to a remote system.

The argument for optimizing network throughput has changed quite a bit as network capabilities and costs have evolved.  A decade ago this was a real concern as 1Gbps networking was the norm, and 10Gbps ports were expensive. Today with 10Gbps onboard ports, 10Gbase-T networking, and switch ports at below $200. Using the Virtual Distributed Switch (vDS) Virtual SAN can intelligently share ports with other traffic using Network IO Control (NIOC) as well as can leverage multiple connection links using link aggregation technologies available with modern switching.

With the industry soon moving to 25Gbps next year being the onboard standard, and 100Gbps multi-lane ports on the horizon, the immediate future does not point towards improvements in read throughput being something worth the cost of latency consistency. At the application layer modern applications scale out horizontally rather than vertically further diversifying the traffic profile for IO access. I will admit there are certain niche high throughput applications, but these are generally the fringe workloads that are primarily deployed on legacy modular storage for cost, or highly asymmetric scaling reasons.

Fundamentally one problem many storage platforms have is that they were designed 2-4 years before they were ever publicly shipped. Many of the initial bottleneck’s or design concerns are often less relevant at the time they are released. Early movers often find themselves in a situation 4-5 years in where they realize that their 10 year old design has limitations that fail to address the modern needs of the market. How do you solve this problem?

  1. First it requires flexibility that comes with taking a longer term view of the market, rather than shortcuts to relevance. Companies that took shortcuts in design to address the cost models or challenges of 7-10 years ago will struggle today to quickly change those “features” that have become technical debt that must be paid.
  2. Secondly it requires vision, and constant re-review of the market and its direction as well as an understanding of what customers are doing in their data center (from business critical applications, to end user computing, to cloud native applications). Nearly talking about customers deploying them is not enough. Coming up with custom solutions (like VMFork for Instant clones, and Photon container optimization) is key to staying relevant.
  3. Third it requires hard work. Brilliant development staff has to extend the platform to address the coming challenges while still recognizing what customers need today. Thankfully we have these resources committed to making Virtual SAN better.

In part two of this series we will review uses cases and extensions where latency of the network does matter, and how Virtual SAN has been extended to handle these requirements.


3 comments have been added so far

  1. The network is the most critical component impacting performance and availability for any HCI solution (as FC is in the more traditional architecture). Having good networking is therefore critical. You also should not have to modify your standard DRS practices to use any HCI solution. A good HCI solution will integrate just as well with things like DRS and VAAI as a more traditional environment. With a distributed block based system a portion of a VM data will live on every node in a cluster and this will reduce the impact of migrations of VM’s for the purposes of maintenance or standard resource load balancing, having intelligence about what data moves and when also helps ensure consistency of performance during those operations. Leaving the bulk of the networking for the applications, which is the whole reason we have an infrastructure to begin with. If data were always remote or never moved, the network would more easily become saturated, and cause unpredictable latency and performance. The real benefits of having the data more local to the app really come when you are running business critical apps, and large ones at that, such as large databases. Especially as we see the evolution of flash technology rapidly exceed the capability of network throughput before customers have a chance to actually upgrade their systems. Unfortunately 100G Ethernet is still to slow for Optane (3D XPoint) and for systems that have multiple NVMe drives also mixed with SSD’s. The only place to get the benefit from that is local to the app, but you don’t want to sacrifice mobility and flexibility, or saturate the network causing inconsistent and unpredictable application performance for the most critical applications, which are the ones driving this technology.

  2. I’m glad that we agree you shouldn’t have to disable DRS to use a HCI solution at scale. Sadly I hear from customers they are hearing otherwise from support organizations of other HCI solutions.

    I see your focused on read throughput, but this ignores writes that must ACK no matter what, and the latency issue. The original promises of latency for 3D Point turned out to be over hyped (See link below). We MIGHT get there at some point, but optimizing your IO layout for something that’s going to happen in a few years in the future implies a lack of flexibility on your underlying platform. VSAN’s changed their entire data layout twice now as technology and the IO path has evolved (Its an object system). As the speed of the underlying components change I’m sure there will be more changes.

    Its looking like flash specifically isn’t getting orders of magnitude faster yet, just orders of magnitude cheaper per GB (A lot of people bet wrong on this, and optimized systems for hybrid not all flash). With the rising hype of RCoE, and NVMoE I’d argue storage networks are dropping in latency at a faster pace than Flash or even 3DCrosspoint specifically.

    Now DRAM and NVDIMMS are interesting, but the question of if your really craving that much throughput there are other arguments to be had (Byte vs. Block addressable, Just running in memory with compression on the data set).

    This all loops back to the real elephant in the room and the reason big enterprises used 3 or 4 controller systems back in the day for their mission critical databases. If I can’t hit my performance SLA during maintenance operations I don’t have a highly available system, and I’ve just added cost and complexity pointlessly. For these .1% workloads (I’m possibly being generous here), there are a number of solutions that cover this niche (100GB/s PCI-Express switched, or systems with RAS feature sets that allow double execution and hot swap CPU/DRAM). The future isn’t here quite yet, and as always its a bit cloudy which resource will truly become the real bottleneck for normal everyday workloads.

    Wrapping up. VSAN today uses, todays technology to deliver consistent, performant results for non-unicorn workloads. It has evolved to handle new use cases and technologies and will continue to.


Leave a Reply

Your email address will not be published.