Everyone seems to be talking about “Big Data” these days. We’re bombarded with information in online and print media about the explosion of machine generated data, the petabytes of data that companies like Facebook and Twitter generate, and the billions of dollars of opportunities awaiting all businesses through the use of big data. We also hear about what seems an alphabet soup of new technologies to process and analyze big data: Hadoop for distributed data processing, R for analytics, Lucene for text indexing and search, Mahout for machine learning…the list goes on and on.
If you’re a business user, you’re thinking that big data could give you an edge over your competition. If you’re a developer, you’re excited about the many new technologies you can learn about. If you’re an architect, you’re trying to figure out how all these big data technologies fit within your existing and future infrastructure.
Architecture Questions on Big Data and Big Analytics
As we are all asked to design approaches with both technical merit and cost appropriateness, the traditional questions come up, and a significant amount of internal and external research begins:
- What are the business goals and drivers?
- What architectural principles and governance will be followed?
- What is the scope?
- What do current and future state systems look like?
- What are the performance needs?
- Where are the greatest cost implications?
In the Big Data arena, key architecture fundamentals change, and we hope the framework below provides two things for you:
- A simple model to start thinking through cloud architectures for Big Data and Big Analytics at your company.
- A prompt to identify what you might learn from our “Big Data and Big Analytics” Panel at VMworld [link].
Beginning with a Simple Framework
At VMware, we’ve been using a simple framework to look at the key components of a Big Data system and help our customers work through many architectural decisions as they explore the world of big data. Big data often brings four newer and very different considerations in an enterprise architecture:
- Data sources have a different scale – while the most obvious, many companies work in the multi-terabyte and even petabyte arena.
- Speed is critical – nightly ETL (extract-transform-load) batches are insufficient and real-time streaming from solutions like s4 and Storm are required.
- Storage models are changing – solutions like HDFS (Hadoop Distributed File System) and unstructured data stores like Amazon S3 provide new options.
- Multiple analytics paradigms and compute methods must be supported:
- Real-time database and analytics: These are typically in-memory, scale-out engines that provide low-latency, cross-data center access to data, and enable distributed processing and event-generation capabilities.
- Interactive analytics: Includes distributed MPP (massively parallel processing) data warehouses with embedded analytics, which enable business users to do interactive querying and visualization of big data.
- Batch processing: Hadoop as a distributed processing engine that can analyze very large amounts of data and apply algorithms that range from the simple (e.g. aggregation) to the complex (e.g. machine learning).
The diagram below illustrates this framework and shows that some components, or potentially the entire big data system, can run on a cloud infrastructure, which can make the system elastic, highly available, and multi-tenant. With resource sharing, we can ultimately bring the benefits of cloud computing to big data and keep budgets in check.
A Panel focused on Big Data Architecture Approaches
Since everything around big data is evolving very fast, there are many different perspectives on architecture, technologies, and products. We hope to bring several of these perspectives to you at VMworld San Francisco. I will be moderating a Big Data and Big Analytics panel discussion with an incredible group of industry visionaries and practitioners, who are ready to share their insights and respond to questions from the audience. Here are some examples of the types of questions the panelists will address:
- What are the best examples of big data applications that you’ve seen?
- What are the best practices for big data systems architecture?
- What is the role of virtualization and cloud computing in big data?
- How do you see the world of big data evolving over the next few years?
Collectively, this group has several decades of experience building, using and managing Big Data and Big Analytics projects. Let me introduce the panel participants:
- Amr Awadallah, CTO and Co-Founder at Cloudera, will talk about the impact Big Data and Hadoop are having on the industry, drawing upon his extensive experience at Cloudera as well as previously leading organizations like VivaSmart and Yahoo!.
- Zubin Dowlaty, VP and Head of Innovation & Development at Mu Sigma, will share customer scenarios and analytical approaches from his 18 years of experience applying quantitative methods from corporate data assets.
- Jim Kaskade, EIR at PARC (a Xerox company) will talk about the breakthrough ideas in Big Data and Customer Experience Management coming out of the research groups at PARC, where they use technologies like Hadoop, NoSQL databases, complex event processing, R and many more.
- Richard McDougall, CTO of Application Infrastructure at VMware, will be ready to discuss architectural approaches for Big Data and Big Analytics solutions on a virtualized platform, drawing on his deep experience with scalability, availability and performance of distributed systems.
- Stephen O’Sullivan, Senior Director at Walmart Labs, will draw upon his 20+ years of experience creating enterprise applications and data management solutions and his leadership experience at companies at the bleeding edge of technology like Walmart, LiveOps, Yahoo! And Sun
If this sounds like the kind of session you will enjoy, we would love to see you at
APP-CAP 1963 (Big Data and Big Analytics Panel).