Have you ever heard of a zettabyte? If you work in IT, you’ll be hearing more and more about zettabytes, exabytes, and petabytes while the data terms we think are big, such as terabytes and gigabytes wane away from our vocabulary. Right now, we are growing our data stores by 50% year-over-year, and its only accelerating.
While data volumes are skyrocketing, the type of data is also becoming more difficult for traditional databases to handle. Over 80% of it will be unstructured file based data that does not work well with block-based data storage typical of your typical relational databases (RDBMS). So, even if hardware innovations could keep up to support greater volume, the kinds of data we are now storing break traditional RDBMS at today’s speeds.
The bottom line is the volume and types of data being stored is unrealistic for a single, monolithic, structured RDBMS data store. They need to be broken apart and re-architected to survive the Information Explosion we are experiencing today.
Why Traditional Databases Are Already Broken
To understand why traditional data stores will not scale in this explosive data era, its useful to remind ourselves of how they work. Essentially, an RDBMS sorts structured data in neat rows and columns. To retrieve the data, we relied on Structured Query Language (SQL) that specifically describes the data required for the next transaction or report and also explicitly named the database location. While data could be federated across different application sub-systems, such as a customer database and a provisioning system, architects usually tried to organize data that would be used together in a single store to increase performance and availability of data to end users.
This data exists on disk, and works best when data is structured and easily categorized into those neat rows and columns. That world is gone.
The days when we had the time to architect perfect database schemas, so all data could be organized into those neat rows and columns is past. In the new world, the volume, velocity and unpredictability of data coming from modern sources over-whelm any pre-designed schema almost instantly. As uses of data change, there isn’t time to design new structured schemas, upgrade data or endure any downtime.
Additionally, with the volume of data we are collecting and using everyday, we are hitting the upper limits of machine storage and performance tradeoffs for monolithic databases. Even if the database solutions could perform competitively in this area, it would likely be cost prohibitive to have single large data stores.
The NoSQL Revolution
While many companies are still learning of new data storage approaches for big, fast data, IT companies and open source innovators have been working on a few approaches over the past few years. It started with NoSQL distributed database models. This strategy worked by eliminating much of the structured data in the query and reduced queries to straight key-value pairs. It also supports partitioning of data across several machines as well as replication. These are all good things, as they help us to scale data horizontally across many commodity servers versus a single expensive disk array. It also greatly sped up queries in general. But, this solution still relies on data being stored on disk, and replication takes time to do all the reads and writes. Also, since there could be a delay in replication, result sets could be out of date, so financial transactions and other types of transactions that need reliability are not good candidates for this solution. So, while it worked, performance was inconsistent and it did not deal well with data replication across large distances.
In short, we could do better.
The Advent of NewSQL
In looking at the performance gains of the distributed systems and the limitations of data consistency that comes with simple data caching and a lack of schema, some companies embraced a middle ground of sorts. NewSQL, as the name suggests, keeps SQL as a way to access structured data, but also introduces distributed data structures that help with performance. Because NewSQL still uses SQL, designing database queries can be simpler, reducing complexity for developers to a degree, but still using distributed systems to improve performance. As with NoSQL, concepts like data partitioning and replication while maintaining data consistency are brought to bear.
This solution is a great compromise for many organizations. The structured schemas mixed with distributed data helps applications looking to do transaction based processing very well and at scale. Where it can fall short is analytical processing because the schema is not necessarily set up for the desired queries, and the way the databases handles the data distribution makes it so these NewSQL databases do not scale linearly.
In-Memory Data Stores Are Faster, More Consistent
We know that the NoSQL data architectures worked from a few standpoints: they handled unstructured data well and they partitioned data so we could scale horizontally with cheap commodity servers. But as humans, we always hunger to do better. We wanted this solution to be faster, more consistent and scale even better. The challenges with NoSQL can be blamed on its usage of disk. So why not eliminate it?
Over the past few years, memory has gotten cheap and is easily commoditized in the cloud. So moving your data strategy to put it all in-memory just plain makes sense. It eliminates an extra hop to read and write data from disk, making it inherently faster and the performance more consistent. It also manages to simplify the internal optimization algorithms and reduce the number of instructions to the CPU making better use of the hardware.
It also supports both NoSQL and NewSQL style solutions, affording both of them better performance and allowing developers the freedom to choose the right modern solution for their data needs.
Breaking Databases Apart
Big, fast data is coming. To survive this data revolution, smart businesses need to ensure their data strategies are prepared to harness the power of their data and maintain complete histories of their data. They’ll need to partition their data across multiple data stores, and relearn how to architect, query and store data.
For more resources on how to employ these new data patterns and the VMware products that will help you in this new information age: