Over the last few years we've seen a frenzy of interest and buzz around the area of Big Data. Beyond the hype, there is a solid base of growing use cases, which are becoming center stage to most businesses. 2011 was the year of awareness. There was a great amount of sharing from the early core developers of the analytic platforms - showing the rest of the world the capabilities of the tools and platforms that had been developed for special purpose high scale analytics. The big names at the core of open source analytics development include Facebook, eBay, Linkedin, Twitter - all blazing the trail with new approaches. These companies brought along with them a new and expanding interest in leveraging the same technologies for commercial interest.
In 2012, I saw much more activity within core enterprise and business. There are a growing number of enterprises that are already heavily invested in the use cases - but by volume, most customers now have some form of big data proof-of-concept underway. These proof of concepts typically start with a thesis of how competitive advantage can be gained through insight from the data. A proof of concept can quickly validate the theory, and helps sell further investment in the analytics platform, and it snowballs from there.
VMware made awesome progress this year in making vSphere a great platform for big data, with the mission of allowing all varieties of big data storage and analytics frameworks to run on a common virtual infrastructure platform. In support of this, we've teamed up with the Hadoop community to validate virtual infrastructure as a differentiated Hadoop platform and make the combination of Hadoop and virtualization better than the sum of its parts. The highlights for this year include:
Now onto my predictions of how 2013 will unfold. Drumroll, please!
Prediction #5) We will all know at least one colleague who is bragging about a Petabyte stockpile of new data.
We're seeing a growing list of new sources of data, most of it being machine generated. It's estimated that in 2013, we'll produce 4 Zetabytes (that 4 million petabytes) of new data. Over 80% of that will be unstructured - in the form of files, documents, media, logs, and other types. That will amount to a jaw dropping 1 quintillion new objects.
The current research is showing a growth rate of between 50-60% per year of these new types of data. As an example, one customer I've been working with is building out an architecture to store every single key-click, mouse-over and application log event for every user for two years. This will give them tremendous insight into what their customers’ interests are, and allow them to do sophisticated targeted marketing. Keeping this data amounts to an estimated storage stockpile of 200 Petabytes!
The economics alone is a forcing function towards new storage architectures. If we store 1 Petabyte today in a regular storage system, that's typically a storage investment of several million dollars. The challenges are the costs of storage, the administrative overhead of managing this much data, and bringing enough computation to the data in a way that we can reasonably filter, organize and analyze the data.
Prediction #4: ‘Delete’ will become a forbidden word
There's definitely a mindset change about keeping data - with a change from storing important data to keeping ALL data. The problems is that we don't know up front what questions we want to ask of the data, so if we don't keep that information we are precluded from doing whatever insightful analytic that could have been the “killer usecase”. If we keep all data, then we can keep open all options for interesting analytics. The data scientists can develop new theories and models, and go back in time to understand these new models.
I believe we'll see a growing number of companies who follow the same path. They will setup sufficiently large-scale data stores and scale-out analytic tools so that keeping all data is affordable and practical.
Prediction #3: There will be a mad dash for software-defined storage
I predict we’ll see a flurry of new technologies and companies that will claim to offer different renditions of software-defined-storage, aimed at storing this mass of data. The traditional model of whole-system storage hardware will change in light of the volume of data tilting heavily towards new data types, and a blurring of the line between compute and data.
The growth rate of traditional data (customer records, transactions, history) just doesn't grow at anywhere near the rate of the new data. Traditional enterprise data is only growing at 20% or so - but as we saw, the amount of new data being stored is growing in the order of 50% year over year. This means that there will be two key shifts within the storage industry - a move towards more commodity-based storage that can potentially take the place of traditional storage, and a new set of high-scale storage architectures aimed at storing all this new data.
The chase will come from multiple dimensions:
Prediction #2: The default infrastructure for Big Data will change
We should expect a tipping point in network infrastructure, 10GBE networks and high-bandwidth switch topologies. Cost metrics will afford the majority of new big-data installations to take advantage of 10GBE, resulting in a different set of assumptions about optimal big-data systems. Cross-sectional bandwidth within a rack of 1Tbit will ease focus on data locality, and put the emphasis more on designing storage topologies for availability. In 2013, data and compute can be anywhere in a switch domain with little or no performance difference. Beyond 2013 we'll see more interesting flat networks evolve, which will even further relax the locality requirement.
Additionally, the decreasing cost of flash and the increasing availability of software to take advantage of multiple tiers of storage will mean that flash will be an integral part of every storage architecture. Hot blocks will be placed automatically on SSD, and writes will be buffered by SSD to give much lower latencies. In some cases, entire applications data sets will be moved to flash based storage tiers.
Prediction #1: The focus on big data use cases will shift heavily towards real-time
Businesses are starting to realize they now have a significant and new competitive advantage with the ability to make real-time decisions based on their own data.
A few of the top use cases include:
As a result, in 2013 I predict we’ll see an emergence of the frameworks and technologies required to implement these systems. The significant component will include:
Almost every application being built to incorporate these techniques is hand-rolled. In 2013, we’ll see startups emerging with new PaaS-like frameworks to aid in the development of these real-time applications.
As the need shifts from a monolithic map-reduce powered platform to a hybrid of real-time, batch and machine learning, there will strong need for running multiple framework types on the same cluster. We believe that virtualization will play a central role in creating that common distributed platform, and we see a growing number of enterprises in 2013 standarding on virtualization as the platform for their big-data solutions. I can’t wait to see how all this plays out next year!