Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

More and more application workloads are moving to the different cloud platforms, whether they be public, private or hybrid clouds. Big data and analytics application workloads are on the move too. It is important that the data science/data engineering users of big data platforms and analytics applications gain a good understanding of the infrastructure in these clouds so as to make the best use of them for doing their analytics work more effectively. Virtualization is at the core of all modern cloud environments – it is the cloud infrastructure shown below. The unit that provides the flexibility, elasticity, ease of management and scaling in any cloud is the virtual machine – essentially through the hardware independence and portability that virtual machines offer. 

Although non-virtualized servers may be supported in some clouds, it is very rare in our experience that a cloud deployment would use this native hardware approach – and it can become inefficient. No cloud service provider wants to be duty bound to acquire and provision new hardware servers when you want to expand your analytics processing cluster or other distributed application– that kind of setup can take some time! Virtualization is the answer here, through rapid provisioning of virtual machines for this purpose – given the hardware capacity to do so, of course. Multi-tenancy on the cloud is also achieved through virtualization. Two tenant workloads may live on common servers, but are separated from each other through their encapsulation in virtual machines.

For those who are new to the data science and data engineering terms, here is a very brief introduction. You can think of the data engineer as the one who integrates data from many sources, cleanses it, organizes it on a computing platform, provides indexes to it and generally sets it up in readiness for optimal querying and interrogation. Those people who then follow on and express the queries and analysis algorithms on that cleansed data are the data scientists. They attempt to answer the business users’ questions.  In some organizations the data science/engineering roles are clearly delineated, while in others the same person may play both roles. Certain organizations have an overseer of these personnel who are skilled in data and this person is called the Chief Data Officer or CDO. Here is an outline picture of the landscape in which data scientists/engineers and CDOs operate.


Data scientists and data engineers have been accustomed to running their data processing and analysis work on a bare metal or physical environment up to now. But with the recent rapid growth in cloud infrastructure, these folks need to understand the new virtualized infrastructure within their clouds, as it is now underlying and controlling their workloads. We will go through the main points of interest for the data scientist in this virtualization area and the benefits from using it here.

The Industry Moves

Many data science/engineering workloads are based on Hadoop and Spark platforms today, with Python, Scala, R, Java or others as the programming environment that operate on them. We see a big growth in interest in deploying these platforms to all types of clouds over recent months. As one example, Databricks, the leader in the development of Spark, deploys their application platform to the public cloud, first. The Spark technology is absolutely suitable for the private cloud too, as we will see in a later section. Leading Hadoop distributors such as Cloudera, Hortonworks and MapR have developed tools for deploying their distributions to public and private clouds. The pure play analytics/machine learning vendors, like H2O and Turi also deploy their software as a service on the cloud. We see many smaller software companies deploying their big data products or infrastructure from day one to some form of cloud, no longer just to bare metal. There was an early misconception that virtualization in this way would slow the big data analytics workload down. However, extensive testing has shown that quite the opposite is true. Performance is as good as bare metal for virtualized private cloud-based big data workloads that use the underlying virtualization layer in the right way. We will show some testing results here to prove that point. The result of all of this is that the pace of companies’ moving to cloud is now picking up.  

Sharing the Cloud Infrastructure Language

  Business managers ask their data science teams to find the answers to key business questions. Data scientists depend on their data engineers to integrate, load, cleanse, index and manage the data so it is suitably organized for their queries. These queries or jobs can range from questions about fraud detection, customer pattern analysis, product feature use, sentiment analysis, product quality or many other business areas.   Data science teams are made up of people with a variety of skills, with analytics and statistical processing in general being very prevalent among them, along with data cleansing, data integration and SQL/programming ability as essential skills for data engineers.  Data scientists/engineers are often not involved in choosing or managing the infrastructure supporting their applications, though naturally they want to get the highest level of flexibility and the best performance from their applications.   It is very advantageous to the organization if the data science/data engineering people can speak the language of the cloud infrastructure decision makers, so they can have a conversation about the best deployment choices. This can be as fundamental as how many virtual CPUs or how much memory a set of virtual machines should have, for optimal behavior of particular workloads –or about the architecture of the system as a whole.

Iteration on a Data Problem

Data scientists often iterate several times on the solution to a data analytics question. They refine queries that give different answers to a question over time before they are happy with the results. They expand or contract the quantity of data used for queries and along with that, the processing framework that holds the data – such as a Spark cluster as one example. This is a dynamic environment, where the amount of compute power and storage needed to support the analysis is unpredictable. Demand on the infrastructure can fluctuate widely over the course of a single project’s lifetime. This variability means that the application infrastructure must be open to expansion and contraction at will, according to user needs.

System Services

The types of software services that data scientists need will vary too – requiring a lot of freedom of configuration by the end user community. One group may be using dashboards, another workbooks/notebooks, others SQL engines for querying data while others still will write programs in Python and Scala to process the data. The toolkit for the data scientist/data engineer is growing continually with new features appearing regularly. To do their work properly, data scientists need a scalable compute infrastructure and high-performance data storage mechanism to support them.  Their demands on the infrastructure will vary, but when they need it, performance is at a premium. The data science teams also need their supporting infrastructure to be available on-demand – their time should not be wasted in waiting for that infrastructure to be provisioned. The scientist may use the infrastructure heavily for a period of time and then move off the initial project to some other activity. Trading off the compute requirements of separate data science teams with different measures can be a significant task for the manager or the Chief Data Officer (CDO), who is in over all of these teams. This is the key area in which virtualization and cloud can help these communities.  When managers such as the CDO are concerned with keeping their data analytics teams operating at maximum efficiency, they don’t want the infrastructure getting in the way.  By carving up their total set of computing resources into pools that can be allocated to teams flexibly, they can avoid the single-use purpose to which many physical clusters were initially put and use cycles from elsewhere that happen to be available.

Multiple Changing Factors in the Big Data World

In big data, one application infrastructure does not fit all processing needs. At the Hadoop level, for example, older distributed platforms may be suitable for some batch-type workloads whereas Spark should be used for other more interactive requirements or for iterating over a dataset. This means that there will almost certainly be more than one distribution of the platform (e.g. the Spark version) or combinations of other products in use at any one time. We have found many versions and two or more distributions of this software in use at once at many of the enterprises that we interact with, for example. Virtualization provides the key to running all of these variants at once, with separation between them.   Other variables are also at play:  

  1. The types of questions being asked by the Chief Data Officer vary over time requiring differing application types to support them
  2. The infrastructure, (such as open source Hadoop and Spark distributions) are changing at a rapid pace
  3. Multiple versions of the software infrastructure are likely to be needed at the same time by different teams
  4. Separation of performance concerns across these teams is essential
  5. Data may be shared across multiple teams while the processing they do on that data may differ
  6. Certain instances of the infrastructure may be tuned for interactive response times while others are designed for batch processing

These variables all lead to a need for the type of flexibility that only virtualized platforms provide. It does so by separating each group/version/distribution/application from others, giving them their own sandbox or collection of virtual machines to work in and isolating the performance of one collection from another.

Isolation and Security

Separating one team’s work from another’s is important so that they can achieve the SLAs the business needs. Separation in the virtualization sphere is done through use of groupings of servers and virtual machines in their own confined resource pools – so as to limit their effect on external workloads.   Example: Different teams at a health care company use a Hortonworks cluster for certain analysis jobs, while running alongside a MapR cluster on the same infrastructure, for other analytic purposes.   Distributed firewalls in purpose-built virtual machines can be deployed to exclude access to certain clusters and data from others that happen to be on the same infrastructure. Because these distributed firewalls are in virtual machines, they can move with the workload and are not confined to physical boundaries.    


  Data Scientists are tasked with getting to the answer to their queries within a time boundary. Performance of the systems that execute their work is of prime importance to this community.   To test the performance of data science programs on virtual machines, VMware engineers executed a set of Machine Learning (ML) algorithms on Spark clusters, where each node in the cluster is held in a virtual machine. These tests were also executed on the same servers without virtualization present and the result of the testing were compared. Here is one example of the results.



These types of systems train a model with a large set of example data to begin with, so that the model will recognize certain patterns in the data. These example sets were in the 700 million to 2.1 billion entry range in the above setup. Each example had 40 features or different aspects of one record of the data (such as the amount of the transaction and the expiry date for a credit card purchase)   The model, once trained on a large number of examples, is then presented with a new transaction instance. This could be a credit card transaction or a customer’s application for a new credit card for example. The analysis model is asked to classify that particular transaction as a good one or a fraudulent one. Using statistical techniques, the model gives a binary answer to this question. This type of analysis is called a binary classification. Several other kinds of machine learning algorithms were tested also.

What the results of these tests show is that these ML algorithms, operating in virtual machines on vSphere, present as good performance and in some cases better performance than the equivalent bare metal implementation of the same test code. A full technical description of these workloads and the results seen are given here 

Choices with Virtualization  for Cloud Deployments

  The technical choices that are open to the Data Scientist/Data Engineer and the infrastructure management team as they begin their deployment of analytics infrastructure software in the cloud are highlighted here.

  • Sizing the virtual machines correctly
  • Placement of the virtual machines onto the most appropriate host servers
  • Configuration of different software roles within the virtual machines
  • Assigning groups of computing resources to different teams
  • Rapid provisioning
  • Monitoring applications at the infrastructure level


Sizing Virtual Machines

You choose the amount of memory, number of virtual CPUs, storage layout and networking configuration for a virtual machine at creation time – and you can change it later if you need to. These are all software constructs rather than hardware ones, so they are more open to change. In big data practice so far, we have seen a minimum value of 4 vCPUs and a general rule of thumb value of 16 vCPUs for higher-end virtual machines running Hadoop/Spark type workloads. Applications that can make use of more than 16 vCPUs are rare but still feasible. Those would for code that is highly parallelized with significant numbers of concurrent threads within them, for example.   The amount of memory required on a virtual machine can be as large as your physical memory, although we see configured memory at 256Gb or 512Gb values for contemporary workloads. Basic operation for the classic Hadoop worker nodes can fit within a 128Gb memory boundary for many workloads. These sizes will be determined by the application platform architects, but it is also valuable for the end user, the data scientist and data engineer, to understand the constraints that apply to any one virtual machine. Given the hardware capacity, these numbers can be expanded in short order to see whether the work completes faster under new conditions.

Placement of the Virtual Machines onto Servers

  In some cloud environments, you get to make the choice about fitting virtual machine onto host servers and the racks that contain them; in others you don’t. Where you are offered this choice, you can make decisions that will optimize your application’s performance. For example, in private clouds, you can choose to not overcommit the physical memory or physical CPU power of your servers with too many virtual machines or too large a configuration of virtual machine resources.  Avoiding this over-commitment of physical resources generally helps the performance of those applications that are resource intensive, such as data analysis ones.   A corollary question here is how many virtual machines there should be on any one physical host server. In our testing work with virtualization, two to four suitably-sized virtual machines per server can yield good results as a starting point – and this number can be increased as experience grows with the platform and the workload. Where there is visibility to the server hardware type and configuration available to you, then NUMA boundaries should be respected in sizing your virtual machines, so as to get the absolute best performance from the machine. The virtualization best practices document mentioned above has detailed technical advice in this area.

Different Software Roles within the Virtual Machines

In most contemporary distributed data platforms, the software process roles are separated into “master” and “worker” nodes. The controlling processes run on the former and the executors that drive individual parts of the work are run in the workers. You create a template virtual machine (a “golden master”) for each of these different roles and then copy or clone that virtual machine depending on how many instances of that role you need. There may be an option to install the analysis software components themselves into the template virtual machine and thereby save on repeating that installation when it comes to instantiation of the clones or members of the cluster. These nodes are customizable with new software services at any time in the lifetime of that cluster, through the vendors’ management tools. We can likewise de-commission certain roles from the cluster in the same way, such as when we need fewer workers to carry out the analysis work.

Assigning Groups of Computing Resources to Different Teams

  An inherent value of the virtualization approach for the cloud is the pooling of the hardware resources, so that parts of the total CPU/memory aggregate power can be carved out and dedicated for specific purposes and teams. Virtualization includes the concept of a “resource pool” with the capability to reserve a set compute and memory power for this purpose (across several machines). One department’s set of virtual machines lives within the confined space of its resource pool while a separate department lives in another. These may be adjacent to each other physically. When needed, more compute resources can be allocated to one or the other dynamically. This gives the manager of a data science department that has competing requests from his/her teams much more control over the allocations of resources, depending on the needs and relative importance of each team’s project work.

Rapid Provisioning

The creation of a set of virtual machines for a data scientists’ use is of course much more rapid than providing them with new physical machines, yielding better time to productive work. The further flexibility for expansion and contraction of the population of virtual machines in that set is even more important. In larger deployments of analytic applications that we have seen, rapid expansion occurs in the initial stages, going from 20 nodes (virtual machines) to over 300 nodes in a single cluster. As time moves on and new types of workload appear, that large infrastructure may contract, so as to fit a new one beside it on the same hardware collection, though separated into a different resource pool. That type of flexibility is a function of virtualization. The strict ownership boundaries that delineated private collections of servers for individual teams are broken down here.

Monitoring Applications at the Infrastructure Level

  The data scientist/engineer wants to know why the query is taking so long to complete. We know this phenomenon well from the history of databases of various kinds. SQL engines are among the most popular layers on top of big data infrastructure today – indeed, almost every vendor has got one. There is a constant need to find the inefficient or poorly structured query that is hogging the system. Virtual machines and resource pools provide isolation of that rogue query to begin with. But further to that, if the cloud provider allows it, being able to see the effects of your code at the infrastructure level provides valuable data towards tracing its effects and optimizing its execution time.

Cloud Infrastructure is on the Move

  In our opening remarks, we mentioned public, private and hybrid clouds. There is a wide interest in using common infrastructure and management tools across these types of clouds. When these base elements are the same for private and public cloud, then applications can live on either, as best fits the business need. In fact, the application can move from one to the other in an orderly fashion, without having to change. The lines will blur for the data scientist/engineer as a particular analysis work project may start off as an experiment in a public cloud and as it matures, it migrates to a private cloud for optimization of its performance. The reverse move may also take place. Announcements were made in late 2016 that the full VMware Software Defined Data Center portfolio will execute in AWS as a service. This common infrastructure across private and public cloud extends the flexibility and choice for the data scientist/data engineer. Analysis workloads on the VMware Cloud on AWS may now reach into S3 storage on AWS in a local fashion, within a common data center, thus bringing down the latency of access for data. In this article, we have seen the value of understanding the virtualization constructs for the data scientist and data engineer as they deploy their analysis onto all kinds of cloud platforms. Getting the best out of a cloud deployment necessitates that understanding of the power of virtualization in those data preparation and analysis communities. Virtualization is a key enabling layer of software for these data workers to be aware of and to help achieve the best results from their analytics work.

Related Articles