The new database is opening up significant career opportunities for data modelers, admins, architects, and data scientists. In parallel, it’s transforming how businesses use data. It’s also making the traditional RDBMS look like a T-REX.
Our web-centric, social media, and internet-of-things are acting as a sea-change to break traditional data design and management approaches. Data is coming in at increasing speeds, and 80% of it cannot be easily organized into the neat little rows and columns associated with the traditional RDBMS.
Additionally, executives are realizing the power of bigger and faster data—responding to customer demands in real-time. They want analysis, insights, and business answers in real-time. They want the analysis to be done on data that is integrated across systems. And, they don’t want to wait a day to load it into a data warehouse or data mart. As a result, developers are changing how they build applications. They are using different tools, different design patterns, and even different forms of SQL to parse data.
Understanding Data Scalability Strategies: Distributed, In-Memory, Replication and Partitioning
Data grids are a result of the school of thought for Almost-Infinite Scaling for Applications, made popular by Pat Helland. As Helland posits:
“…the number of customers, purchasable entities, orders, shipments, health-care-patients, taxpayers, bank accounts, and all other business concepts manipulated by the application grow significantly larger over time. Typically, the individual things do not get significantly larger; we simply get more and more of them. It really doesn’t matter what resource on the computer is saturated first, the increase in demand will drive us to spread what formerly ran on a small set of machines to run over a larger set of machines...”
Essentially, a big data grid helps applications deal with massive amounts of data by distributing the queries across many nodes working in parallel. To pull the data back together to be used by applications, these grids use data locators for data that is spread across multiple nodes, regions, and partitions. As another benefit, these nodes are typically smaller commodity servers, which are cheaper to purchase and replace, and since they are not specialized in any way can also be spread to the cloud easily for unlimited access to pay-as-you-go infrastructure.
In this post, we are going to dig into the architecture of a big data grid and how data can be colocated and queried across nodes for scalability and improved performance.
Big data grids also store data across many in-memory nodes—basically replacing RDBMS data stores on disk. By eliminating the read-write step from disk, data can be queried up to 10x more rapidly. There are still data grids that do not use in-memory data, but the performance improvements have certainly given birth to an entire new category of data grid solutions that has become very popular with application developers.
Replication is another tactic that can play a role in scalability by making the same data available on multiple nodes. This is commonly used to help applications that access the same data frequently to spread query processing across nodes with the same results. Of course, with replication, there can be delays in the data being universally updated, so this strategy is best used when individual transactions are not critical to the result set. For instance a customer order or payment status is likely unacceptable to have results out of sync, as your customers would become easily upset when their latest transaction is not present. However, if your application frequently fetches product availability, chances are a general result of availability is good enough until the actual order process is started. This work can be spread across several nodes to improve response times.
Understanding Data Partitioning and Colocation
Beyond these basic strategies, developers should look to partition aware database design strategies that can improve the performance of data queries. Modern data grid solutions, like SQLFire, support automatic data partitioning where developers can intentionally organize data so to optimize against their application’s data query patterns and avoid expensive locking or latching across multiple partitions during query execution. One strategy developers should consider is colocation—putting the data into the same memory space for reporting and analysis.
Similar to the world of web hosting, colocation means you put things together—in this case, data. With the right big data grid design, this forces data with common foreign keys to be on the same node in the grid. In other words, simply by having different types of data that is commonly used together on the same node, you can join across tables or databases on your big data grid using the same node’s cache, thus improving the speed of the query and optimizing the performance overall of the system.
For instance, if for billing purposes you regularly query customer contact information and order information together to form billing statements, you can set up your table structure by stating something like
'create table Orders (...) colocate with Customer' and keep all contact information and all orders for a single customer in a single member. Then, anytime you perform an operation on those data types a single customer uses the cache of only a single member.
This does involve using strategies to ensure that both of these types of data are spread across nodes in a manner where customer data sets are on the exact same node so that all types of queries can be performed at high volumes and speeds. Colocated joins will be on the same, cost-effective virtual machine memory space that can also be scaled up and down on-demand.
We typically see these types of big data grid requirements with financially-oriented systems, e-commerce, advertising, mobile, gaming, social, and other significant enterprise business problems that deal with massive data sets or fast moving data and a requirement for analysis to make business decisions.
The Colocation Scenario and Our Best Practice Guide’s Example SQL
Now, let’s look at the example of the SQL used in this type of colocation scenario—from Figure 11 of the SQLFire Best Practices Guide. In this example, we colocate
FLIGHTAVAILABILITY data with
FLIGHTS data so that a travel application can quickly present potential fliers with available flights that meet their travel needs.
|>> Note, the SQLFire Best Practices Guideis an excellent resource for anyone looking to understand how this “distributed, big data grid world” works and provides a lot of detail for implementing and managing SQLFire specifically along with tuning SQLFire JVMs.
In the SQL statement and diagram below, we have six nodes. Flights data is partitioned across three nodes and replicated (R) across another three nodes. Airlines data is available across all six nodes, and
FlightAvailability data is also partitioned across three nodes and replicated (R) on three nodes. With the example companies volume, scalability improvements are achieved by allowing the separate nodes to serve queries. In addition, the SQL shown includes a colocation statement—this forces the partitions and columns with the same values for
FLIGHT_ID to be colocated on the same node. Any rows that satisfy the range or list are colocated on the same member for all the colocated tables. So, a query for flights from NYC to LA and availability on those flights would be on data from the same node.
To learn more about vFabric SQLFire:
- Peruse over 59 articles in the SQLFire category on this blog—enough for data architects to prepare a business case or plan a big data grid architecture for virtually any type of system.
- Learn how vFabric SQLFire and GemFire can perform map-reduce style, distributed querying
- Read the background on why traditional databases are breaking.
- Understand SQLFire’s fit within the new vFabric Reference Architecture.
- See how SQLFire is blowing away RDBMS - Scaling and Performance for .NET and Java.
- Learn about game changing capabilities like groups, partitioning, and redundancy.