Cloud Updates Migration Optimization Tips

Best Practices for Managing Amazon Elastic MapReduce

As more and more organizations look to big data to help grow their business, they need to tackle data sets from public sources, eCommerce systems, social media and other sources to glean information about their business. Apache Hadoop provides a cost-effective means to transform, cleanse, filter, analyze and gain new value from all kinds of data, such as log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

What is Amazon Elastic MapReduce?

Amazon Elastic MapReduce (Amazon EMR) is a Hadoop based web service that makes it easy to process vast amounts of data quickly and cost-effectively. Hadoop is an open source framework to distribute data and processing across clusters of Amazon EC2 instances.

To run the an EMR job, an EC2 cluster is instantiated (one or more EC2 instances, in a master-slave relationship). Once the cluster is initiated the cluster is setup, transitions to Ready, executes a series of steps, then is Terminated. In the course of a month there may be thousands of EMR clusters which go through this life cycle. CloudHealth tracks when each cluster was initiated, ready, and terminated, as well as the final “State” of the cluster.

Without effective management one could ask the eternal question: If an EMR cluster terminates abruptly in the cloud and no-one is watching, does it make a blip on your radar? With potentially hundreds of clusters comprising tens of thousands of instances a week, unfortunately the answer is often no.

Let’s explore a few of the common challenges in EMR management and some solutions.

 

Challenges in EMR Management

Customers managing EMR environments face three common challenges: Classification of EMR clusters, cost management for EMR, and monitoring EMR health. Let’s break each one down individually.

Classification of EMR Clusters

Classification is essential for proper EMR management. It is the only way to see the forest from the trees. Because the CloudHealth platform has the ability to classify related objects, you only need to ensure the EMR cluster itself is tagged. CloudHealth will identify the cluster instances, and any other associated objects such as volumes. We will also track Spot requests related to the launch of a cluster, and On-Demand, Spot, or reserved instance based usage per cluster.

One critical thing to remember about EMR is that the EMR charge is only a fraction of the true costs of EMR. Instance usage, data transfer, and storage charges can be significant and are not included in the EMR charge. By classifying at the cluster level, we capture the direct EMR charges, as well as the cost of downstream resources which are members of the cluster. Without this classification, you can’t properly assess EMR costs, and you can’t do analysis on the cluster as a whole.

Best Practices When Classifying EMR

EMR is essentially the analysis of data sets using clusters. Grammatically, it is often expressed in this form in cluster names [Verb]:[Noun].It is useful to think of EMR in terms of jobs and payloads.

  • The job is the defined set of ingest, analyze, output steps that the cluster will perform.
  • The payload is the input data, which is processed by the job, producing a set of outputs.

In many ways, EMR is a classic case of X-Y analysis, where the X is the job, and the Y is the payload.

For classification of EMR, it is a best practice to tag clusters both with the analysis job, and the set of data being analyzed. Once this classification is done, all tagged clusters can be gathered to form a history by job, payload, or a combination of the two.

Cost Management for EMR

Once the clusters are properly tagged and allocated, you can see the true costs of your EMR environment. Here is a sample of hourly charges showing the distribution of the full cost of EMR. In this case the data transfer costs associated with the instances are the biggest cost, followed by the compute costs of the instances themselves. When you stand back and think about it, the transfer costs are for the huge data sets ingested by the cluster in operation. The compute cost is spread across potentially thousands of short lived instances.<

Cost Management EMR

Classification for Cost and ROI Analysis

As we’ve seen, getting to the true numerator in the cost/benefit analysis requires consideration of the full costs. Once that is done, you need to allocate these costs properly to different activities. This is where the flexibility of the CloudHealth cost reports come in. In conjunction with Perspectives you can easily report on the true costs of particular activities.

Cost reports can be used to Classify EMR costs by job, by customer, and by job for a customer. Reports can show granularity day, week, or month to help to you accurately assess the cost per customer, or data set over a given period of time. Because these reports contain all of the costs associated with a cluster they go beyond the EMR charges to include all the associated costs of instance hours, and other related costs. Without this type of classification it is very difficult to accurately assess the true costs of EMR.

Cost History EMR Filtered by Hours

This report shows the overall costs of various jobs over time across all clients. With CloudHealth’s extensive filtering capabilities you can quickly traverse from very broad to very specific reports and queries.

Cost History EMR Filtered by Jobs

Usage reports can tell you what instance types were utilized for a set of clusters, and their pattern over time. Below is a chart showing the instance hours per week, by instance type for a particular job. You can also visualize this by the distribution of Reservation type, showing the distribution of On Demand, Spot, and Partial Upfront usage.

EMR Usage<

Monitoring EMR Health

Because EMR is by nature ephemeral, forensics are difficult. The CloudHealth EMR Cluster Hours report is an effective tool for assessing sets of clusters after they have run. Below is an example of a report on a set of EMR clusters run over time for a particular analysis job.

For each run we can see the setup time, the overall duration, the end state of the cluster, and the last State change message. These reports can be easily exported to Excel for further study or analysis.

EMR Cluster Hours Tabular

The Cluster Hours report can also be used to do a detailed breakdown of costs, as seen here.

EMR Cluster Hours with Associated Costs Tabular

Final Thoughts

EMR is a key AWS service that removes many of the barriers to big data analysis because it provides easy setup, and a pay as you go model. However, without proper tools in place, some companies find EMR expensive and challenging to manage. Using CloudHealth, you can easily classify and report on groups of clusters to see what they cost, how they were composed, how well they ran, and their detailed cluster by cluster costs, enabling you to take control of your EMR environment.

CloudHealth provides a substantial capability to classify, manage, and report on the usage, cost, and general health of your EMR environment. Using CloudHealth, you can easily classify your clusters by job type, or customer, or data set. You can also cross- reference jobs and data sets to see the costs, instance patterns, and status of the EMR environment. You can do detailed cost breakdowns by individual cluster, by component, and for classifications of clusters over time. These costs can be trended over time to gain insight to the pattern and trend of EMR costs.

See it for yourself, request a demo of CloudHealth today.