Home > Blogs > VMware vFabric Blog


7 Myths on Big Data—Avoiding Bad Hadoop and Cloud Analytics Decisions

Hadoop is an open source legend built by software heroes.

Yet, legends can sometimes be surrounded by myths—these myths can lead IT executives down a path with rose-colored glasses.

Data and data usage is growing at an alarming rate.  Just look at all the numbers from analysts—IDC predicts a 53.4% growth rate for storage this year, AT&T claims 20,000% growth of their wireless data traffic over the past 5 years, and if you take at your own communications channels, its guaranteed that the internet content, emails, app notifications, social messages, and automated reports you get every day has dramatically increased.  This is why companies ranging from McKinsey to Facebook to Walmart are doing something about big data.

Just like we saw in the dot-com boom of the 90s and the web 2.0 boom of the 2000s, the big data trend will also lead companies to make some really bad assumptions and decisions.

Hadoop is certainly one major area of investment for companies to use to solve big data needs. Companies like Facebook that have famously dealt well with large data volumes have publicly touted their successes with Hadoop, so its natural that companies approaching big data first look to the successes of others.  A really smart MIT computer science grad once told me, “when all you have is a hammer, everything looks like a nail.” This functional fixedness is the cognitive bias to avoid with the hype surrounding Hadoop. Hadoop is a multi-dimensional solution that can be deployed and used in different way. Let’s look at some of the most common pre-concieved notions about Hadoop and big data that companies should know before committing to a Hadoop project:

1. Big Data is purely about volume—NOT TRUE

Besides volume, several industry leaders have also touted variety, variability, velocity, and value. Putting all arguments about alliteration aside, the point is that data is not just growing—it is moving further towards real-time analysis, coming from structured and unstructured sources, and being used to try and make better decisions. With these considerations, analyzing a large volume of data is not the only way to achieve value. For example, storing and analyzing terabytes of data over time might not add nearly as much value as analyzing 1 gigabyte of really important, impactful information in real time. From a tool-set perspective, you might want an in-memory data grid built for real-time pricing calculations instead of a way to slice and dice historical prices into a dead horse.

2. Traditional SQL doesn’t work with Hadoop—NOT TRUE

When Facebook, Twitter, Yahoo! and others bet big on Hadoop, they also knew that HDFS and MapReduce were limited in their ability to deal with expressive queries through a language like SQL. This is how Hive, Pig, and Sqoop were ultimately hatched. Given that so much data on earth is managed through SQL, many companies and projects are offering ways to address the compatibility of Hadoop and SQL. Pivotal HD’s HAWQ is one example—a parallel SQL-compliant query engine that has shown to be 10 to 100s of times faster than other Hadoop query engines in the market today—and it was built to support petabyte data sets.

3. Kill the Mainframe! Hadoop is the only the new IT data platform—NOT TRUE

There are many longstanding investments in the IT portfolio, and the mainframe is an example of one that probably should evolve along with ERP, CRM, and SCM. While the mainframe isn’t being buried by companies, it definitely needs a new strategy to grow new legs and expand on the value of it’s existing investment. For many of our customers that run into issues with mainframe speed, scale, or cost, there are incremental ways to evolve the big iron data platform and actually get more use out of it. For example, in-memory, big data grids like vFabric SQLFire can be embedded or use distributed caching approaches for dealing with problems like high-speed ingest from queues, speeding mainframe batch processes, or real-time analytical reporting.

4. Virtualized Hadoop takes a performance hit—NOT TRUE

Hadoop was designed originally to run on bare metal servers, however as adoption has grown many companies want it as a data center service running in the cloud. Why do companies want to virtualize Hadoop? First, let’s consider the ability to manage infrastructure elastically—we quickly realize that scaling compute resources, like virtual Hadoop nodes, help with performance when data and compute are separated—otherwise, you would take a Hadoop node down and lose the data with it or add a node and have no data with it. Major Hadoop distributions from MapR, Hortonworks, Cloudera, and Greenplum all support Project Serengeti and Hadoop Virtualization Extensions (HVE) for this reason. In addition, our research with partners has show that Hadoop works quite well on vSphere and can even perform better under certain conditions—running 2 or 4 smaller VMs per physical machine often resulted in better performance, up to 14% faster, than a native approach according to benchmarks we’ve done with partners.

5. Hadoop only works in your data center—NOT TRUE

First of all, there are SaaS-based, cloud solutions, like Cetas, that allow you to run Hadoop, SQL, and real-time analytics in the cloud without investing the time and money it takes do build a large project inside your data center. For a public cloud runtime, Java developers can probably benefit from Spring Data for Apache Hadoop and the related examples on GitHub or online video introduction.

6. Hadoop doesn’t make financial sense to virtualize—NOT TRUE

Hadoop is typically explained as running on a bank of commodity servers—so, one might conclude that adding a virtualization layer adds extra cost but no extra value. There is a flaw in this perspective—you are not considering the fact that data and data analysis are both dynamic. To become an organization that leverages the power of Hadoop to grow, innovate, and create efficiencies, you are going to vary the sources of data, the speed of analysis, and more. Virtualized infrastructure still reduces the physical hardware footprint to bring CAPEX in line with pure commodity hardware, and OPEX is reduced through automation and higher utilization of shared infrastructure.

7. Hadoop doesn’t work on SAN or NAS—NOT TRUE

Hadoop runs on local disks, but it can also run well in a shared SAN environment for small to medium sized clusters with different cost and performance characteristics. High bandwidth networks like 10GB Ethernet, FoE, and iSCSI can also support effective performance.

Taking Action to Overcome the Myths

While many of us are fans of big data, this list can help you take a step back and look objectively at the right approach to solving your big data problems. Just like some building projects need hammers and others need screwdrivers, hacksaws, or a welding torch, Hadoop is just one tool to help conquer big data problems. High velocity data may push you towards an in-memory, big data grid like GemFire or SQLFire. A need for massive, consumer-grade web scale may mean you need message-oriented middleware like RabbitMQ. Getting to market faster may mean you need to look at a full SaaS solution like Cetas, and Redis may meet your needs and find a home in your stack much easier than a full blown Hadoop environment.

To learn more about the products in this article:

 

This entry was posted in GemFire, RabbitMQ, Serengeti, SQLFire and tagged , , , , , , , on by .
Adam Bloom

About Adam Bloom

Adam Bloom has worked for 15+ years in the tech industry and has been a key contributor to the VMware vFabric Blog for the past year. He first started working on cloud-based apps in 1998 when he led the development and launch of WebMD 1.0’s B2C and B2B apps. He then spent several years in product marketing for a J2EE-based PaaS/SaaS start-up. Afterwards, he worked for Siebel as a consultant on large CRM engagements, then launched their online community and ran marketing operations. At Oracle, he led the worldwide implementation of Siebel CRM before spending some time at a Youtube competitor in Silicon Valley and working as a product marketer for Unica's SaaS-based marketing automation suite. He graduated from Georgia Tech with high honors and an undergraduate thesis in human computer interaction.

13 thoughts on “7 Myths on Big Data—Avoiding Bad Hadoop and Cloud Analytics Decisions

  1. Profit From Home Academy Scam

    Howdy! This post could not be written much better! Looking through this
    post reminds me of my previous roommate! He continually kept talking about this.
    I’ll forward this post to him. Fairly certain he’s going to have a great read.
    Many thanks for sharing!

    Reply
  2. skin care products

    A fascinating discussion is worth comment. I do think that you should write more about this subject, it might not be a taboo matter but usually
    people don’t discuss these issues. To the next! Cheers!!

    Reply
  3. Pure Igf Reviews

    Nice post. I used to be checking continuously this weblog and I am
    inspired! Very helpful information specially the final section :) I handle such info much.
    I was seeking this particular info for a long time.
    Thanks and best of luck.

    Reply
  4. Pingback: Customer Centricity, CMO and Big Data | Aditya Kamalapurkar

  5. Dezyre

    Excellent post, Adam! I love reading about some of these myths, as they are all so true. Number 6 seems to be the most constant, from my experience.

    Great post.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>