MADlib, the open source analytics library shepherded by Pivotal data scientists and UC Berkeley researchers, gets a fresh coat of a paint with a major relaunch of the project’s website. Allowing practitioners to perform Big Data analytics within SQL databases, MADlib offers a scalable library of algorithms which provide considerable speed and cost benefits to organizations.
MADlib is the result of conversations that began in 2009 between industry experts and academic researchers to develop new approaches to scalable, sophisticated in-database analytics. The “MAD” in MADlib refers to the library’s “magnetic”, “agile”, and “deep” environment for analysis, well-suited to a wide range of Big Data use cases across various industries. Supporting Postgres, Pivotal Greenplum Database, and Pivotal HAWQ, the open source library receives ongoing development by Pivotal as well as researchers at UC Berkeley, Stanford University, and University of Florida.
The key principles driving MADlib development are:
- Operate on the data locally—in database. Do not move it between multiple runtime environments unnecessarily.
- Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
- Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
- Open implementation maintaining active ties into ongoing academic research
The library’s sophisticated algorithms deliver many of the demands of a data-driven enterprise. In concert with Pivotal HD and HAWQ, MADlib offers “Deep Scalable Analytics,” offering “data-parallel implementations of mathematical, statistical, and machine-learning methods for structured and unstructured data.” Its features include classification, regression, clustering, topic modeling, association rule mining, descriptive statistics, and validation.
Use cases for MADlib’s robust library of algorithms are varied, applicable to the data science needs of a number of industries including retail, advertising and public relations, financial services, media and telecommunications, manufacturing, energy, government, as well as healthcare and life sciences.
The speed and flexibility offered by MADlib, working in concert with HAWQ, is borne out by a recent case study by Adam Bloom, which demonstrated that this dynamic duo was able to “improve the speed of analysis by over 318x and reduce analytic queries from 24 days to 6 minutes” for one of Pivotal’s retail/e-commerce customers.
To learn more about MADlib, and download distributions of the library for Pivotal Greenplum Database, Linux, or OS X, visit the new MADlib site.