Big Data Performance and Best Practices: New Spark Application Measurements

The Performance Engineering team at VMware has produced another highly useful report and blog on best practice/performance work they have done in the Big Data area. This new report contains test result data from benchmark tests conducted using Spark-based as well as MapReduce applications. The report also gives you specific advice on best practice implementations. Spark was originally created in the US Berkeley AMP Lab, further developed in a major Apache open source project and now also marketed by a commercial company, DataBricks. Spark is the latest programming framework in the Hadoop world.

Spark presents a new API to developers that improves the ease of creation of applications with an improved API over the older MapReduce style. Spark proponents claim also that the new framework improves the speed of their applications once deployed on it. At the core of the new report, the performance engineers show that Spark-based applications perform as well on vSphere as they do on native, and better when the appropriate best practices are applied. Of particular interest to those of you who are working in the Machine Learning (ML) space will be the results in the report that show a set of open source ML libraries and toolkits being used, such as the Support Vector and Logistic Regression test suites. Here is one graph from the report to whet your appetite for further reading.