apache_hadoop apache_hawq big_data_suite case_studies pivotal_hd

Case Study: Analyzing Retail/e-Commerce 318x Faster on HAWQ +MADlib

In e-commerce or traditional store retail, information about purchase behavior is key to managing the profit and loss of the business. For one of our customers, they saw how HAWQ could improve the speed of analysis by over 318x and reduce analytic queries from 24 days to 6 minutes.

This type of performance gain allows business intelligence teams to impact operations by an order of magnitude.

Analysis, Statistics, and Business Intelligence in Retail

who-uses-association-rulesIn retail, there are an endless number of questions about which items are purchased together, what promotions are working, where margins can improve, how much product should be kept in stock, and why customer loyalty improves. The answers to these questions help senior executives establish priorities and help store, region, or product line managers make advertising and promotions more profitable, increase basket size, improve profit margin, increase inventory turns, and impact similar metrics. There are large teams of people, a complex array of systems, and big investments made in analytical systems to help large retailers get a competitive edge and improve operations. Even in mid-sized e-commerce companies with smaller teams and IT departments, there is a priority placed on doing more and better analysis to increase revenues.

One of our large retail customers wanted to increase the speed, cost, and scale at which they could analyze this type of data—for example, running association rules via MADlib with greater throughput on big data. By having more capacity in analytical processes like market-basket analysis, lines of business could find a greater number of useful relationships like “people who purchase charcoal and sockeye salmon also buy several other high margin items like white wine, French gruyere, and artisan crackers.” The insight from this type of analysis leads to better advertising, merchandising, promotions, and margins, allowing the business to run more profitably.

Where HAWQ Fits in Retail or e-Commerce Environments

Our customer was using Hadoop as a central landing zone for 10 years information, and it included both unstructured and structured data. In addition, they had a robust enterprise data warehouse (EDW) with 2 years of data. They used a large number of data marts to slice and dice analytics reports, explore scenarios, run market basket analysis, and make better business decisions. The marts included data that was sourced from both the Hadoop File System (HDFS) and the data warehouse. In addition to the data marts, there were several data services and applications using the data from both Hadoop and the EDW.

Their initial plan was to keep separate data warehouse and Hadoop environments that pushed data back and forth in real time via a high-speed interconnect. When they began looking at Pivotal HD and HAWQ to improve their data mining and business intelligence operations, they had several goals: 1) they wanted to see better performance than Hive, 2) they wanted to run SQL for business intelligence and see high concurrency, and 3) they wanted to have a high-performance approach to using MADlib’s data mining association rules for statistical analysis and machine learning.

How HAWQ Improved Performance and Business Results

In comparing Hive versus HAWQ, their initial results showed key improvements—a test on one day’s worth of data saw Hive take 1593 seconds (or 26.5 minutes) versus 5 seconds in HAWQ—this improvement was 318x faster. Another test showed 24 days (or 34,560 minutes) of queries get boiled down into six minutes—a speed improvement of 5760x. These type of improvements meant more data could be used for analysis, more analysis could be done, and a greater impact could be had on the metrics that matter to executives.

hawq-test-results

They also learned that HAWQ provides out of the box support for data-parallel implementations of MADlib’s mathematical, statistical, and machine-learning methods. If you aren’t familiar with MADlib, it is an open-source, community library of in-database, statistical analytic methods and SQL-based algorithms designed with parallelism in mind. While the project has grown into version 1.0, MADlib v.0.3 was initiated in 2010 by Greenplum engineers and UC Berkeley researchers and originally included the following methods:

  • Supervised Learning: linear regression, logistic regression, naïve Bayes classification, decision trees, support vector machines
  • Unsupervised Learning: k-Means Clustering
  • Descriptive Statistics: Count-min sketch, Flajolet-Martin Sketch, data profiling, and quantiles
  • Support Modules: sparse vectors, array operations, and conjugate gradient optimization

Beyond their original goals, their look into HAWQ also made them realize several disadvantages within their initial architecture plan. One, the separate Hadoop and data warehouse systems approach meant duplicate data, duplicate storage costs, and additional network overhead. Moving data took additional cycle time as well—particularly with such large amounts of data. With HAWQ, they saw a way to simplify their architecture by having a single system. They could do this because HAWQ allowed them to run SQL on Hadoop’s File System. Simplifying the architecture meant lower capital expenditures. It also meant more productive operations because two separate systems require more resources and different skill sets to administer and operate than one.

In the HAWQ architecture, there was only one system to invest in, learn, and deploy. They didn’t have to keep developing with Hive, HBase, Pig, or MapReduce—an area with several major development investments in the future. Instead, the entire team of DBAs, users, admins, and analysts could begin using existing SQL resources and existing SQL development experience. This also meant that all the money spent and code written on past SQL data mining projects could largely be recouped. Similarly, they could still utilize recent MapReduce investments and run MapReduce jobs where it made the most sense.

To learn more about Pivotal HD and HAWQ: