hadoopBlog was Co-Authored by Michael Haag

Besides fueling some of the fastest growth and changes in the IT industry over the past few years, Big Data and software-defined storage (like vSAN) have another less obvious similarity. Users of both share a fundamental desire to minimize—or often eliminate—the dependency on costly and complex external storage. Hadoop and vSAN both rely on industry-standard servers as the underlying hardware platform.


vSAN has been extremely successful in helping to grow the adoption of industry-leading hyper-converged infrastructure (HCI), with more than 7,000 customers in a very short period of time (~3 years). The growth stems from several key reasons that align with main of the desires of big data users:

  • The software-defined nature of vSAN eliminates the dependency on proprietary hardware, a fundamental characteristic of most Big Data environments.
  • vSAN can be deployed on X86 servers and there are more than 175+ vSAN Ready Nodes that customers could choose from, providing the choice and flexibility that Big Data users want as they look to manage and control their environment to meet their specific needs.
  • Customers can start small and scale their HCI environment powered by vSAN as their needs grow, which aligns with the flexibility and agility needed in most big data deployments.

These are some of the main reasons why enterprises are adopting vSAN for all of their key use cases: Tier-1 workloads, VDI, remote offices and branch offices, disaster recovery sites, and management clusters.


But another maybe the key characteristic that really opens the eyes of Big Data users is the inherent enterprise-class reliability, management, and security that vSAN brings. These vSAN features address a key operational gap in Big Data solutions, like HDFS, but at half the cost of enterprise storage. For now at least, we agree that the cost won’t be as low as bare metal; however, as businesses rely more on Big Data analysis to drive strategic projects then the manageability and protection of these environments becomes increasingly important.

Lately, we have been getting many asks from our customers to provide a solution for big data workloads with vSAN. They would like to get the same benefits that they have been able to get with other use cases:

  • Standardize their whole infrastructure on one platform based on x86 servers that helps minimize IT silos.
  • Reduce infrastructure management time and costs through unified and familiar tools
  • Start with small POCs or projects with the agility to easily expand (or repurpose infrastructure) as they go forward


As a result of this natural alignment on the approach to infrastructure and growing popularity of both solutions, we have been working with our strategic partner Intel to test, validate and document running Big Data on top of all-flash vSAN. We feel the time is right to start considering all-flash vSAN for big data for several other key reasons::

  • The cost of Flash is decreasing drastically. From a TCO perspective it is more cost effective to move everything to All-Flash
  • There has been a lot of innovation on Flash technologies in terms of performance and capacity. We expect this trend to continue, which makes Flash the most dominant storage medium for Enterprises
  • Customers want to future proof their infrastructure investments


You can find the VMware and Intel white paper outlining how to deploy vSAN with Hadoop here:


Initially we presented some of our findings and best practices at VMworld U.S and this white paper provides all the details that we shared during that session. This includes details on some of the future enhancements related to performance optimizations and also affinity/anti-affinity support for the applications that have availability built-in in the form of a technical preview.

As the adoption of both big data and HCI continues at a rapid pace, we are continuing to innovate and optimize vSAN to further align on other goals and needs of our big data users.