Hadoop is used by some pretty amazing companies to make use of big, fast data—particularly unstructured data. Huge brands on the web like AOL, eBay, Facebook, Google, Last.fm, LinkedIn, MercadoLibre, Ning, Quantcast, Spotify, Stumbleupon, Twitter, as well as some more brick and mortar giants like GE, Walmart, Morgan Stanley, Sears, and Ford use Hadoop.
Why? In a nutshell, companies like McKinsey believe the use of big data and technologies like Hadoop will allow companies to better compete and grow in the future.
Hadoop is used to support a variety of valuable business capabilities—analysis, search, machine learning, data aggregation, content generation, reporting, integration, and more. All types of industries use Hadoop—media and advertising, A/V processing, credit and fraud, security, geographic exploration, online travel, financial analysis, mobile phones, sensor networks, e-commerce, retail, energy discovery, video games, social media, and more.
At first glance, it sounds like many of the above business needs were already solved by conventional data warehouses, business intelligence, and statistical analysis programs. This is not the case—the conventional systems begin to fail when the data sets become too large, include fast-growing unstructured data formats, or face both of these issues. With size and complexity issues, traditional BI systems can become too expensive. This is why Hadoop was invented.
Simply put, Hadoop follows the MapReduce model to slice data into chunks of work, spread the work across a large number of commodity servers, and aggregate the work back into a single output. It’s parallel computing approach out-scales the old models and is more cost-effective at doing so.
Effectively Managing Hadoop from the CFO’s Eyes
In the early days of the web and enterprise apps, everyone got so enamored by the potential for growth and productivity that both business and IT teams spent money prematurely—we ended up with a massive number of underutilized servers that cost us an arm and a leg to operate. Then, we spent more to virtualize these resources, get better utilization out of our datacenters and reduce our overhead.
With the big data technology trend, we are facing the same excitement around Hadoop. It’s going to be an investment area for the next decade or two, and your CFO is going to see this coming. This time around, we can spend IT dollars much more wisely buy putting Hadoop on virtualized infrastructure from the beginning. For those of us that have learned the painful TCO lessons from the past and understand the economics of virtualization, here is a list of ten key, financially sound, cloud infrastructure requirements that should be part of any Hadoop project:
- Initial Hadoop projects should be explored for the most pressing issues in the company and start by aligning with the CEO and CFO’s top needs and goals.
- Hadoop investments should run with the same data center efficiency and cost effectiveness as other virtualized platforms that have high server consolidation ratios and require less CapEx and OpEx than non-virtualized environments.
- Hadoop pilots should identify a big problem, make the scope concise, and complete quickly to prove the time-to-value and identify future costs and risks thoroughly. We all learn by doing—don’t drag out the time to value by over-engineering.
- Hadoop must be able to co-locate with existing applications and run on existing virtualized hosts. This approach should accommodate a Hadoop pilot without new hardware or help manage shared infrastructure budgets in a cost-effective manner.
- Hadoop nodes should use the concept of time sharing. For example, when email, database, web, or ERP applications are idle, the compute power available should be transferred to Hadoop nodes that are analyzing improvements in business performance.
- The Hadoop infrastructure should be able to scale up or down elastically, on-demand, and across clouds for burst compute needs. This capability would allow you to expedite a big analysis on your company’s performance by temporarily adding new Hadoop nodes on a 3rd party cloud service to increase capacity.
- Hadoop VMs should not require significant resources to scale, provision, deploy, replicate, or move because a cloud-centric, virtual machine infrastructure can accommodate this.
- Hadoop should be available to the company as a shared service. This is one of the most cost-effective ways to provide Hadoop as a service. In this model, it is available to all departments based on chargeback accounting. Even with shared services, virtualization still allows for enough isolation to meet independent business and security needs.
- Hadoop should not require expensive, high availability or fault tolerance (i.e. no downtime) frameworks based on hardware. Distributed computing is meant for commodity computing in the cloud.
- Hadoop training, at least at a high level, should be provided to every IT person who engages with various business units and departments—Hadoop attracts talent and paves careers.
To learn more about how VMware is helping virtualize Hadoop clusters, check out Project Serengeti.