A great majority of data-intensive jobs run on databases, powered by either high-end servers (sometimes even mainframes). From a traditional data processing perspective, all the data is kept on a single database instance or even on multiple instances, but all sharing the same data. (share-everything, no data partitioning).
This approach was still valid for a couple of years, and the legacy is there to prove it. However, with the data explosion brought by the last decade and specially on the last few years, most of those jobs are not currently satisfying their goals. Due to relational model, this huge growth on data usually brings exponential increases on processing time of those jobs, specially those ones who have to iterate or search through all the data available. The jobs which handled around 1 million records a few years ago nowadays are handling dozens of millions of the same records and the processing times jumped from minutes or few hours to many hours and even days.
This problem is specially recurrent on some industries as follows:
– Finance Institutions – where risk management, trading positioning, cash flow and other important jobs run overnight.
– Telco and other service providers – running billing batch jobs.
– CRM systems for any industry – demanding consolidation of customer bases with external systems
The cost on the delay of those executing jobs is very variable, however always having an important impact most enterprises can't handle. This impact can come from either delays on opening your bank branches in the morning up to being able to handle less days on your billing cycle (and so impacting cash flow).
As traditional databases only scale well vertically, the database vendor's approaches are to scale the hardware vertically, moving the DB machine to a larger server – of course extremely expensive. Because of that I've seen customers buying machines with up to 2 terabytes and dozens of CPUs only to run their business critical batches faster.
So, what's the alternative?
To handle big data processing we need another approach. Shared-everything model is clearly limited on scalability and need to handle distributed locks to be able to keep the multiple servers synchronized on inserts and updates.
Instead of that, GemFire suggests the strategy of data partitioning, where data is divided into the different servers which compose a GemFire Distributed System. Of course this strategy is combined with replication for maximum availability. The idea is data records are divided between GemFire servers on a way each server can now run its data functions independently based only on the data which is currently hosted on it.
So, as an example if we take a batch job which would iterate over 100 millions of records, we could partition this data on 10 nodes and each one would handle only 10 millions. Of course, adding another 10 nodes would automatically re-partition the data so each node would handle 5 millions records and so on. On a batch processing, function can be distributed among all the members, such as each server only processes those records which are available for it. Like that, we scale not only on having multiple processors working in parallel, but also on dividing the big data on smaller chunks of records and having each server to process a much smaller set of data.
However, this becomes even more interesting if we knew each of those servers can be standard commodity x86 machines, usually multiple times cheaper together than a single instance of the big machines used to host enterprise databases. This way, we scale horizontally both saving costs and improving performance at the same time, while guarantee linear scaling.
The proof of concept
The proof of concept was conducted on a potential customer – big healthcare company from Latin America – handling millions of customers and seeing its data increasing about 30% per year. Due to business model, legal affairs and regulatory / compliance needs, they should be able to run a customer reconciliation process as frequent as possible in order to avoid fraud and unnecessary payments.
This process which few years ago took few hours to complete is nowadays taking from 10-15 hours due to huge increase of data on recent years. All the possible tuning was done on both RDBMS and data structure, but the results didn't change much and customer is currently only capable of running this process twice a month due to other batches which uses the same database.
At the moment we came into this opportunity for the GemFire PoC, customer was aiming to grow vertically his database machine once more to a multi-million server and of course pay much more licenses of his database management system, since it's licensed by CPUs.
GemFire proof of concept goals on 5 working days were:
– Migrate current batch Job from PL-SQL to Java, acessing GemFire API.
– Show impressive gains of performance on commodity x86 servers, running production data
– Prove horizontal linear scalability, showing performance gains while more servers were add to the environment.
– Prove high availability of the solution.
The solution environment
The batch job which had around 700 lines of PL-SQL core was migrated to Java on a 16-hour work effort. Partitioning and replication strategies were defined to best leverage GemFire benefits while keeping high availability of the solution.
The solution was deployed using 16 x 2 vCPU (Xeon 3.07 GHz) 8 Gb RAM server, allocated on a stepped approach to show scalability. As we were using production data, results taken were compared with the ones from production system to make sure there were no mistakes on the implementation.
The GemFire distributed function (the migrated batch) was executed against 6, 8, 12 and 16 servers from the same configuration above and results were taken as follow:
6 server nodes – 128 minutes.
8 server nodes – 61 minutes, or ~50% better than with 6 nodes (as we added 2 nodes or 33% capacity)
12 server nodes – 29 minutes, or ~ 50% better than with 8 nodes (as we added 4 nodes, or 50% capacity)
16 server nodes – 19 minutes, or ~ 30% better than with 12 nodes (as we added 4 nodes or ~ 30% capacity)
Those impressive results really showed both extreme high performance with linear horizontal scalability and also high availability, as each server had two copies of its data on other members of the cluster and the many tests we conducted taking few members down didn't affect the results or forced a rollback in any case.
For customer this really means he can now execute those jobs on a much less expensive environment and on few minutes, which would enable them to do it more frequently and increase business income.
GemFire can be used as a Data Fabric and Data Grid solution to migrate big data processing functions (as batch jobs running on either database or mainframe programs) while reducing a lot the processing time and saving resources both on hardware and database licenses.
The impact on modifying source code is usually extremely small compared to the money most companies are losing due to their ineficient data handling processes and payed on the next following weeks of the project implementation.
As you might know, this is only one of the various use cases for GemFire Data Fabric. The extreme low-latency and high-throughput case were mentioned on previous posts.