Big Data allows you to find opportunities you didn’t know you had.
Fast Data allows you to respond to opportunities before they’re gone.
The combination of Big Data and Fast Data working together may enable new business models you never could have achieved before.
To elaborate, the idea here is to analyze the historical "Big Data" and look for trends or patterns that have lead to good results in the past. Then you try to model those patterns in such a way that you can detect them as they are unfolding in real-time based on incoming Fast Data. If we could analyze large amounts historical as well as recent data quickly enough, we have an opportunity to influence the behaviors of the actors real-time, and have a better chance to steer them toward the patterns that produce results.
Some Examples of Big, Fast Data
For instance, look at location-based services.
Mobile phone companies are looking at using big-data to determine most common journey patterns for a given individual. From that they would then determine when the user deviates from their regular routine. For instance, a user traveling to a business meeting (rather than going to the office) might need navigational assistance.
There are other really obvious location-based use cases. One is pushing offers when a user enters the local shopping district. Another one that is a very hot topic this week for the 2012 Olympics, the organizers in London are detecting travel congestion for crowd management purposes.
Retailers can also use Big, Fast Data to identify opportunities to prevent customers from purchasing elsewhere. For instance, Amazon will pay shoppers $5 to walk out of stores empty-handed. Amazon is smartly offering consumers up to $5 off on purchases if they compare prices using their mobile phone Price Check app in a store, incenting a customer to remain loyal by buying from them as well as providing Amazon with real-time market intelligence. Talk about capturing opportunities just before they disappear!
Government is Creating New Needs for Big, Fast Data
In the U.S. we passed the Dodd-Frank Act last year, and Europe is following suit with Basel III. These regulations aim to increase transparency into financial markets. This legislation and its impact puts a significant strain on financial institutions’ IT infrastructure, having to evaluate and cope with huge increases in real-time data, which many financial institutions may not be equipped to handle.
The TABB group predicts that “New data-recording requirements under Dodd-Frank may increase derivatives data levels by as much as 400%.” My friends tell me that TABB’s prediction is turning out to be way low. Together, the Dodd-Frank Act and Basel III requirements are increasing the amount of data that financial services companies are responsible for keeping tremendously. Both pre- and post-trade transaction information will need to be made available for those contracts today traded over the counter. Adding to this increase in data growth is the demand that banks record and report every transaction in near-real time to provide a constantly updated picture of how exposed the establishment is to risk at any given point in time. For complex Derivatives OTC products, such risk calculations will drastically increase the compute and storage requirements for trading firms.
This explosion in data has made many of us start to think about the real value of the data we are storing. On the historical axis, we are now storing more data than ever. Going forward, we may be required to keep every version of every piece of data that ever went into any calculation that was delivered to a customer or a government body. At the same time, there is more and more real-time data being created every year. And there are new sources of data. For example, there are companies out there now doing sentiment analysis that includes Twitter sentiment along with our more traditional inputs.
What Approaches Could Work?
To my way of thinking the trick is probably to find the sweet spot where you can make balanced tactical decisions on a subset of both the historical data and the real time data. Some obvious use cases in the financial services world are: compliance, regulatory reporting, risk management, fraud detection, trade surveillance, abnormal trading pattern analysis and credit risk management.
How are we going to handle this growth in demand for data analysis? I think we’re going to have to focus on two key fundamentals.
1. Elasticity. We are going to need to let our data get really big, really quick to be able to grow or shrink the various parts of our data management systems dynamically to adapt and capitalize on peak periods of activity. This new wave of Big Data is probably going to cause much more need for occasionally ramping up to do a new analysis, and then being able to take it back down in size again after the analysis is done.
2. A “tiered” approach to data storage. We are going to need to organize data, and how it is stored, to ensure we have fast access to the data we need. Here’s how VMware thinks we need to organize this data:
- Tier 1: We will need extremely fast transaction capture at the top level – like an eXtreme Transaction Processing (XTP) capability. This tier must be memory resident and horizontally scalable deal with sizes in the hundreds of GB range. It will need to be “Active” and provide notification of interesting changes to the data so that business rules can be processed against the data in real time.
- Tier 2: The On-line Analytics Processing (OLAP) tier, will be fed by the XTP engine in batches, to facilitate any indexing, or other processing needed for longer term storage on fast disk-based storage. This is where the deep analysis will take place most often. The data that is historical but recent will be constantly analyzed using tools like Madlib, or the R-language, or even just complex SQL queries to find emerging opportunities in the recent data, allowing for changes in strategy or course corrections.
- Tier 3: Using the same kinds of storage as Tier 2, Tier 3 will be compressed to facilitate storage of very large data sets in the hundreds of TB range. This data will still be analyzed on a fairly frequent basis, so it will utilize concepts like Greenplum’s columnar storage mechanism to reduce the storage footprint while still keeping the queries for analysis purposes very fast.
- Tier 4: The final tier for online storage should be even further compressed, and on slower and less expensive storage devices so that you can handle data in PB scale when you need to do deep historical analysis. The idea behind this is that you never discard any data until you have exhausted all possibility of being able to extract new opportunities from it. Greenplum’s Hadoop module is a perfect place to store and analyze this fourth tier of data.
VMware’s core business is building the infrastructure to help companies scale their applications. As such, we’ve built out the virtualization infrastructure, as well as our vFabric product family of the core components to help build those apps, including vFabric GemFire for Big Data and vFabric SQLFire for Fast Data.
On the “Fast Data” front, one of the things that we’ve seen with our SQLFire customers is that moving the compute right into the data fabric can cause as much as 75X speed-up for simple operations like pricing and risk. It is highly likely that the same mechanism can also improve the time to detect patterns and anti-patterns for areas like Fraud detection, and Compliance.
We have also noticed trends for customers to use Big Data from GemFire to perform “Map/Reduce” jobs on the huge datasets. These scatter-gather semantics give you the ability to scale to analyze Big Data in much the same way that compute grid gave you the way to scale for structured mathematical problems. SQLFire is great for doing this on ‘now’ data, and we have seen lots of success working with Greenplum for the much larger historical data stored on disk.
Other Customer Needs
We hear a lot from customers is that they need access to the data everywhere. They aren’t likely to copy a petabyte store in multiple locations around the world necessarily, but at least 2 locations for business continuity. As well, they certainly want to be able to have local access to the results of the huge analyses in the form of micro-cubes or partially aggregated results that can be stored in “edge caches” so that they can be sliced and diced “at memory speed” and locally (i.e. where the users are). GemFire WAN Gateways to regional edge caches are an excellent answer.
We are being told that the combination of the eXtreme Transaction Processing (XTP) plus Big Data analytics is greater than the sum of the parts. This is a brand new opportunity for all of us, and we have only begun to see the real-world benefits of this data strategy for customers so far, but it is looking extremely promising.
>> Register for VMworld Session APP-CAP1250 - Fast Data Meets Big Data: Click Here