Yet, legends can sometimes be surrounded by myths—these myths can lead IT executives down a path with rose-colored glasses.
Data and data usage is growing at an alarming rate. Just look at all the numbers from analysts—IDC predicts a 53.4% growth rate for storage this year, AT&T claims 20,000% growth of their wireless data traffic over the past 5 years, and if you take at your own communications channels, its guaranteed that the internet content, emails, app notifications, social messages, and automated reports you get every day has dramatically increased. This is why companies ranging from McKinsey to Facebook to Walmart are doing something about big data.
Just like we saw in the dot-com boom of the 90s and the web 2.0 boom of the 2000s, the big data trend will also lead companies to make some really bad assumptions and decisions.
Hadoop is certainly one major area of investment for companies to use to solve big data needs. Companies like Facebook that have famously dealt well with large data volumes have publicly touted their successes with Hadoop, so its natural that companies approaching big data first look to the successes of others. A really smart MIT computer science grad once told me, “when all you have is a hammer, everything looks like a nail.” This functional fixedness is the cognitive bias to avoid with the hype surrounding Hadoop. Hadoop is a multi-dimensional solution that can be deployed and used in different way. Let’s look at some of the most common pre-concieved notions about Hadoop and big data that companies should know before committing to a Hadoop project:
1. Big Data is purely about volume—NOT TRUE
Besides volume, several industry leaders have also touted variety, variability, velocity, and value. Putting all arguments about alliteration aside, the point is that data is not just growing—it is moving further towards real-time analysis, coming from structured and unstructured sources, and being used to try and make better decisions. With these considerations, analyzing a large volume of data is not the only way to achieve value. For example, storing and analyzing terabytes of data over time might not add nearly as much value as analyzing 1 gigabyte of really important, impactful information in real time. From a tool-set perspective, you might want an in-memory data grid built for real-time pricing calculations instead of a way to slice and dice historical prices into a dead horse.
2. Traditional SQL doesn’t work with Hadoop—NOT TRUE
When Facebook, Twitter, Yahoo! and others bet big on Hadoop, they also knew that HDFS and MapReduce were limited in their ability to deal with expressive queries through a language like SQL. This is how Hive, Pig, and Sqoop were ultimately hatched. Given that so much data on earth is managed through SQL, many companies and projects are offering ways to address the compatibility of Hadoop and SQL. Pivotal HD’s HAWQ is one example—a parallel SQL-compliant query engine that has shown to be 10 to 100s of times faster than other Hadoop query engines in the market today—and it was built to support petabyte data sets.
3. Kill the Mainframe! Hadoop is the only the new IT data platform—NOT TRUE
There are many longstanding investments in the IT portfolio, and the mainframe is an example of one that probably should evolve along with ERP, CRM, and SCM. While the mainframe isn’t being buried by companies, it definitely needs a new strategy to grow new legs and expand on the value of it’s existing investment. For many of our customers that run into issues with mainframe speed, scale, or cost, there are incremental ways to evolve the big iron data platform and actually get more use out of it. For example, in-memory, big data grids like vFabric SQLFire can be embedded or use distributed caching approaches for dealing with problems like high-speed ingest from queues, speeding mainframe batch processes, or real-time analytical reporting.
4. Virtualized Hadoop takes a performance hit—NOT TRUE
Hadoop was designed originally to run on bare metal servers, however as adoption has grown many companies want it as a data center service running in the cloud. Why do companies want to virtualize Hadoop? First, let’s consider the ability to manage infrastructure elastically—we quickly realize that scaling compute resources, like virtual Hadoop nodes, help with performance when data and compute are separated—otherwise, you would take a Hadoop node down and lose the data with it or add a node and have no data with it. Major Hadoop distributions from MapR, Hortonworks, Cloudera, and Greenplum all support Project Serengeti and Hadoop Virtualization Extensions (HVE) for this reason. In addition, our research with partners has show that Hadoop works quite well on vSphere and can even perform better under certain conditions—running 2 or 4 smaller VMs per physical machine often resulted in better performance, up to 14% faster, than a native approach according to benchmarks we’ve done with partners.
5. Hadoop only works in your data center—NOT TRUE
First of all, there are SaaS-based, cloud solutions, like Cetas, that allow you to run Hadoop, SQL, and real-time analytics in the cloud without investing the time and money it takes do build a large project inside your data center. For a public cloud runtime, Java developers can probably benefit from Spring Data for Apache Hadoop and the related examples on GitHub or online video introduction.
6. Hadoop doesn’t make financial sense to virtualize—NOT TRUE
Hadoop is typically explained as running on a bank of commodity servers—so, one might conclude that adding a virtualization layer adds extra cost but no extra value. There is a flaw in this perspective—you are not considering the fact that data and data analysis are both dynamic. To become an organization that leverages the power of Hadoop to grow, innovate, and create efficiencies, you are going to vary the sources of data, the speed of analysis, and more. Virtualized infrastructure still reduces the physical hardware footprint to bring CAPEX in line with pure commodity hardware, and OPEX is reduced through automation and higher utilization of shared infrastructure.
7. Hadoop doesn’t work on SAN or NAS—NOT TRUE
Hadoop runs on local disks, but it can also run well in a shared SAN environment for small to medium sized clusters with different cost and performance characteristics. High bandwidth networks like 10GB Ethernet, FoE, and iSCSI can also support effective performance.
Taking Action to Overcome the Myths
While many of us are fans of big data, this list can help you take a step back and look objectively at the right approach to solving your big data problems. Just like some building projects need hammers and others need screwdrivers, hacksaws, or a welding torch, Hadoop is just one tool to help conquer big data problems. High velocity data may push you towards an in-memory, big data grid like GemFire or SQLFire. A need for massive, consumer-grade web scale may mean you need message-oriented middleware like RabbitMQ. Getting to market faster may mean you need to look at a full SaaS solution like Cetas, and Redis may meet your needs and find a home in your stack much easier than a full blown Hadoop environment.
To learn more about the products in this article:
- Read over 100 articles about GemFire or SQLFire
- Check out the case studies on RabbitMQ
- See the Pivotal HD product page or the Hadoop Virtualization pages on VMware.com
- Learn more about Hadoop in the cloud with Cetas
- Find out more about Redis