Home > Blogs > VMware vFabric Blog > Author Archives: Fred Melo

Author Archives: Fred Melo

How VMware vFabric GemFire reduced a data-intensive batch job’s processing time from 15 HOURS to 19 MINUTES

The problem:

    A great majority of data-intensive jobs run on databases, powered by either high-end  servers (sometimes even mainframes). From a traditional data processing perspective, all the data is kept on a single database instance or even on multiple instances, but all sharing the same data. (share-everything, no data partitioning).

 

Share-everything

 

 This approach was still valid for a couple of years, and the legacy is there to prove it. However, with the data explosion brought by the last decade and specially on the last few years, most of those jobs are not currently satisfying their goals. Due to relational model, this huge growth on data usually brings exponential increases on processing time of those jobs, specially those ones who have to iterate or search through all the data available. The jobs which handled around 1 million records a few years ago nowadays are handling dozens of millions of the same records and the processing times jumped from minutes or few hours to many hours and even days.

    This problem is specially recurrent on some industries as follows:
    – Finance Institutions – where risk management, trading positioning, cash flow and other important jobs run overnight.
    – Telco and other service providers – running billing batch jobs.
    – CRM systems for any industry – demanding consolidation of customer bases with external systems
 

    The cost on the delay of those executing jobs is very variable, however always having an important impact most enterprises can't handle. This impact can come from either delays on opening your bank branches in the morning up to being able to handle less days on your billing cycle (and so impacting cash flow).

     As traditional databases only scale well vertically, the database vendor's approaches are to scale the hardware vertically, moving the DB machine to a larger server – of course extremely expensive. Because of that I've seen customers buying machines with up to 2 terabytes and dozens of CPUs only to run their business critical batches faster.

 

So, what's the alternative?

    To handle big data processing we need another approach. Shared-everything model is clearly limited on scalability and need to handle distributed locks to be able to keep the multiple servers synchronized on inserts and updates.

     Instead of that, GemFire suggests the strategy of data partitioning, where data is divided into the different servers which compose a GemFire Distributed System. Of course this strategy is combined with replication for maximum availability. The idea is data records are divided between GemFire servers on a way each server can now run its data functions independently based only on the data which is currently hosted on it.

     So, as an example if we take a batch job which would iterate over 100 millions of records, we could partition this data on 10 nodes and each one would handle only 10 millions. Of course, adding another 10 nodes would automatically re-partition the data so each node would handle 5 millions records and so on. On a batch processing, function can be distributed among all the members, such as each server only processes those records which are available for it. Like that, we scale not only on having multiple processors working in parallel, but also on dividing the big data on smaller chunks of records and having each server to process a much smaller set of data.
     However, this becomes even more interesting if we knew each of those servers can be standard commodity x86 machines, usually multiple times cheaper together than a single instance of the big machines used to host enterprise databases. This way, we scale horizontally both saving costs and improving performance at the same time, while guarantee linear scaling.

The proof of concept

    The proof of concept was conducted on a potential customer – big healthcare company from Latin America – handling millions of customers and seeing its data increasing about 30% per year.  Due to business model, legal affairs and regulatory / compliance needs, they should be able to run a customer reconciliation process as frequent as possible in order to avoid fraud and unnecessary payments.

     This process which few years ago took few hours to complete is nowadays taking from 10-15 hours due to huge increase of data on recent years. All the possible tuning was done on both RDBMS and data structure, but the results didn't change much and customer is currently only capable of running this process twice a month due to other batches which uses the same database.

     At the moment we came into this opportunity for the GemFire PoC, customer was aiming to grow vertically his database machine once more to a multi-million server and of course pay much more licenses of his database management system, since it's licensed by CPUs.

     GemFire proof of concept goals on 5 working days were:
     – Migrate current batch Job from PL-SQL to Java, acessing GemFire API.
     – Show impressive gains of performance on commodity x86 servers, running production data
     – Prove horizontal linear scalability, showing performance gains while more servers were add to the environment.
     – Prove high availability of the solution.

The solution environment

      The batch job which had around 700 lines of PL-SQL core was migrated to Java on a 16-hour work effort. Partitioning and replication strategies were defined to best leverage GemFire benefits while keeping high availability of the solution.

       The solution was deployed using 16 x  2 vCPU (Xeon 3.07 GHz) 8 Gb RAM server, allocated on a stepped approach to show scalability.  As we were using production data, results taken were compared with the ones from production system to make sure there were no mistakes on the implementation.

 

Results

       The GemFire distributed function (the migrated batch) was executed against 6, 8, 12 and 16 servers from the same configuration above and results were taken as follow:

    6 server nodes – 128 minutes.

    8 server nodes – 61 minutes, or  ~50% better than with 6 nodes (as we added 2 nodes or 33% capacity)
    12 server nodes – 29 minutes, or ~ 50% better than with 8 nodes (as we added 4 nodes, or 50% capacity)
    16 server nodes – 19 minutes, or ~ 30% better than with 12 nodes (as we added 4 nodes or ~ 30% capacity)
   
    Those impressive results really showed both extreme high performance with linear horizontal scalability and also high availability, as each server had two copies of its data on other members of the cluster and the many tests we conducted taking few members down didn't affect the results or forced a rollback in any case.
     For customer this really means he can now execute those jobs on a much less expensive environment and on few minutes, which would enable them to do it more frequently and increase business income.
   
Conclusion

    GemFire can be used as a Data Fabric and Data Grid solution to migrate big data processing functions (as batch jobs running on either database or mainframe programs) while reducing a lot the processing time and saving resources both on hardware and database licenses.
    The impact on modifying source code is usually extremely small compared to the money most companies are losing due to their ineficient data handling processes and payed on the next following weeks of the project implementation. 

   As you might know, this is only one of the various use cases for GemFire Data Fabric. The extreme low-latency and high-throughput case were mentioned on previous posts. 

 

Using Gemfire to Offload Data from Mainframes

Many different reasons do exist on why customers want to migrate from mainframe to a modern platform. Reduzing cost is obviosly one of those main reasons and it basically come on different aspects:

     - Reduzing load (MIPs)

     - Reduzing space and sometimes power usage on Data Centers

     - Increasing the consolidation ratio

     - Be able to run commodity, much less expensive hardware

     - Use a more productive development environment 

     - Employ less specialized developers, since mainframe developers are more rare at each day (and so more expensive)

Other reasons may be related to time-to-market (or how can I change my mainframe application overnight to comply to new market regulations or support the new product will be launched next week?) or eliminating vendor lock-in.

Traditional approach

Regardless, many customers are still very cautious when offloading from mainframe, and this is very understandable. These legacy applications still run the core business for many of them, and although they might cost a lot to maintain are usually very reliable. So, the main strategy used has been writing the new applications (for example, to serve new devices or products) on a modern environment and make those applications access the mainframe to read and write data related to the core business – which is still kept on the mainframe.

Although this strategy can speed-up the development of new systems sometimes, the mainframe is still needed for the vast majority of data access and MIP usage is usually *not* reduced. Actually, it can be increased as new users (and business transactions) come on new channels and devices, which at the very end always access mainframe data.

Also, another long-term problems are created by this approach. Data is segregated on two different models – the legacy mainframe model and the new modernized model – which co-exist but are more different at each day. Complex (and high costly) hooks must be wriiten on the applications to convert from-and-to the "new data model" to-and-from "the legacy mainframe model".  Sincronization of data is also a challenge and frequently cause issues, leading the customers to lose credibility on those new platforms and getting more scared each day on offloading from the so trusted mainframe.

Offloading from mainfame using Gemfire

Based on this, a new strategy has been used by lots of customers worldwide to sucessfully migrate from mainframe to a much more cost-effective modern pleatform but still extremely reliable, with very close-to-real-time performance on an incremental step-by-step approach. This would not only allow those companies to modernize their development environment but more important highly reduce their MIP usage, allowing them to even fully migrate from mainframe when they are sufficient confortable for it.

Using Gemfire Data Fabric platform – based on an elastic high performant data grid model – customers can build a distributed, high performant and horizontal-scalable data access layer on top of their legacy platforms (e.g. mainframe). Data can be loaded from the mainframe (or any other legacy platform) and written back to it as needed, although the transactions are done on a micro-second latency rate, using the distributed memory of the Gemfire data grid cluster members, which is far faster and more scalable than the traditional transactions based on disk persistence. Replication between server peers during transactions is transparent, scalable and guarantees as much transaction consistency and durability as needed, fully reliable. Although data is asynchronously written to disk (so, not depending on disk poor I/O throughput) it is is replicated through the memory of the participating peers and chance of any data loss is limited to a catastrophic failure (data-center complete failure) – or as small as on a traditional disk-persistency, traditional database or mainframe approach, However, Gemfire still takes care of data-center reliable replication through WAN network – possibly an alternate data-center (either backup or active-active) – guaranteeing geographical redundancy. 

This way, the mainframe can still be used to load the legacy data from and as archival data storage, but will not necessarily participating on any transactions (although it can for particular cases). This will immediately speed-up transactions to memory and local data-center network rates and enable to scale horizontally on demand, while reducing the load on mainframe.  Gemfire will guarantee data will still be written to mainframe asynchronously (usually in batches and on a sub-second base) when needed, thus not creating any challenge for other legacy applications which still rely on the data from the legacy data storage. 

Consistency between legacy datastore and the Data Fabric is kept using events which are triggered on Gemfire on each data access. As an example, data can be written also on mainframe to keep syncronization at each time it is inserted or updated on Gemfire. It can also be loaded from mainframe on a timely fashion or each time a value is not found on the Data Fabric, for example. On the other way, a change on data kept on the legacy datastore can be sent to a queue or trigger a function in order to let Gemfire know some value has changed. 

The same events can be used to notify client applications on simple changes in values stored on the Data Fabric or even based on complex criteria (so, server-side continuously running query).

A combination of those data events which can be based on either simple or complex criteria and trigger other events can be seen as an embedded, data-friendly Complex Event Processing (CEP) platform, which can also build a extremely valuable business real-time on demand data platform. 

However, probably one of the most exciting characteristic of this approach is the Data Fabric will run on commodity hardware and will scale horizontally on a linear base. This means customers can start with very small environments and add more servers when needed / desired, which would immediately increase not only the memory capacity, but also the processing power, while Gemfire works as a grid computing platform, distributing the processing between peers (Read: Running jobs on mainframe < link to other article>). Most cases report enhancements on transaction throughput on an order of hundreds to thousands of times and speed-up jobs which traditionally ran on hours to a few minutes or even seconds. 

Based on this, Gemfire would be suitable for basically two different use cases on offloading from Mainframe:

     - Low-latency, high volume data transactions. 

      - Long-running, data intensive jobs – such as batches running overnight

 Wrap-up 

Gemfire has been used with great success as the distributed data platform for core business of large enterprises all over the world for the last decade on important industries like finance trading and telco pre-paid real-time charging. It has a very important fit on VMWare's Cloud Application Platform offering, solving a number of challenges for data in a modern world, such as the classical horizontal scalability issue of relational databases, disk I/O bottleneck, big data / data explosion and scalable access to legacy.  Return on investment for such projects has been as fast as a few months, based on high platform cost savings and business advantage acchieved.

Most customers start by using Gemfire on top of their legacy platforms (e.g. traditional RDBMS', mainframes, file-based persistence) to immediately gain dramatic performance increase on their transactions. Overtime, they gradually start modernizing their applications to access data directly from Gemfire, and some of them even realize they don't need their legacy platforms anymore.