Home > Blogs > VMware vFabric Blog > Tag Archives: hadoop

Tag Archives: hadoop

Breaking the Mindset: Why Hadoop Can and Should Move Past Bare-Metal Deployments to Virtualization

Whenever we’ve dealt with something for a while, our way of thinking about it becomes a habit. Hadoop deals with a lot of data. Currently, the record is 100 petabytes in a Facebook cluster that analyzes log data.  Since it was built by the likes of Google and Facebook to deal with such large data volumes and performance, it originally was built to run on bare-metal servers. Since it wasn’t an option from the get-go, the notion that you can’t have that much data running on a move-able virtual machine safely has largely gone unchallenged.

However, as time has gone on, and technology has allowed for persistent storage on the cloud, organizations have started to rethink this paradigm. In fact, several companies are using Hadoop and big data today to gain competitive advantage. And while they are running it on virtualization, they are not moving the data. There are other advantages.

VMware’s Big Data product line marketing manager Joe Russell, spoke with Roberto Zicari this week in an interview on ODBMS.org that helps articulate why Hadoop not only can run on virtual infrastructure using Project Serengeti, but why companies should consider it to save time and make Hadoop more usable. Continue reading

7 Myths on Big Data—Avoiding Bad Hadoop and Cloud Analytics Decisions

Hadoop is an open source legend built by software heroes.

Yet, legends can sometimes be surrounded by myths—these myths can lead IT executives down a path with rose-colored glasses.

Data and data usage is growing at an alarming rate.  Just look at all the numbers from analysts—IDC predicts a 53.4% growth rate for storage this year, AT&T claims 20,000% growth of their wireless data traffic over the past 5 years, and if you take at your own communications channels, its guaranteed that the internet content, emails, app notifications, social messages, and automated reports you get every day has dramatically increased.  This is why companies ranging from McKinsey to Facebook to Walmart are doing something about big data.

Just like we saw in the dot-com boom of the 90s and the web 2.0 boom of the 2000s, the big data trend will also lead companies to make some really bad assumptions and decisions.

Hadoop is certainly one major area of investment for companies to use to solve big data needs. Companies like Facebook that have famously dealt well with large data volumes have publicly touted their successes with Hadoop, so its natural that companies approaching big data first look to the successes of others.  A really smart MIT computer science grad once told me, “when all you have is a hammer, everything looks like a nail.” This functional fixedness is the cognitive bias to avoid with the hype surrounding Hadoop. Hadoop is a multi-dimensional solution that can be deployed and used in different way. Let’s look at some of the most common pre-concieved notions about Hadoop and big data that companies should know before committing to a Hadoop project: Continue reading

10 Ways to Make Hadoop Green in the CFO’s Eyes

Hadoop is used by some pretty amazing companies to make use of big, fast data—particularly unstructured data. Huge brands on the web like AOL, eBay, Facebook, Google, Last.fm, LinkedIn, MercadoLibre, Ning, Quantcast, Spotify, Stumbleupon, Twitter, as well as some more brick and mortar giants like GE, Walmart, Morgan Stanley, Sears, and Ford use Hadoop.

Why? In a nutshell, companies like McKinsey believe the use of big data and technologies like Hadoop will allow companies to better compete and grow in the future.

Hadoop is used to support a variety of valuable business capabilities—analysis, search, machine learning, data aggregation, content generation, reporting, integration, and more. All types of industries use Hadoop—media and advertising, A/V processing, credit and fraud, security, geographic exploration, online travel, financial analysis, mobile phones, sensor networks, e-commerce, retail, energy discovery, video games, social media, and more. Continue reading

Join Us at Strata – Feb 26-28 in Santa Clara

The vFabric and Greenplum teams will be at Strata on Feb 26-28 at the Santa Clara Convention Center.

While the Pivotal Initiative is forming, both vFabric and Greenplum groups will be represented separately. Of course, you can also learn what’s going on by checking out Strata Greenplum or Strata VMware on Twitter.

If you aren’t familiar with Strata, it is a great conference for those building apps in the cloud. Its focus is all about the future of big data and how to use big data successfully. Speakers include representatives from Google, VMware, Amazon, Microsoft, and many other software companies focused in the big data space. Topics include: Continue reading

Announcing the Availability of vFabric Data Director 2.5, GemFire 7, EM4J 1.2, and More

Application developers and data management teams continue to look for ways to modernize legacy apps, manage costs more effectively, build new apps on robust application platforms, and solve big data problems. These are some of the key reasons why vFabric is on the CIO (or CTO) agenda. With several new product releases in the vFabric Suite, VMware continues to provide a best-in-class application platform and help customers solve their top application development and data management problems.

vFabric Data Director 2.5

Database as a Service (DBaaS) helps companies virtualize data engines and automate management while getting a handle on the costs and compliance issues related to data sprawl. In the newest version of Data Director, several new data engines are supported (in addition to Oracle and Postgres) along with other new capabilities:

  • Support for Microsoft SQL Server 2008 R2 and SQL Server 2012
  • Support for Hadoop deployment, management, and monitoring across all major distributions through Project Serengeti
  • Enhanced automation of Oracle and SQL Server template creation
  • Broad support for Red Hat Enterprise Linux (RHEL) and Oracle Linux
  • Enhanced Oracle database ingestion, including ingestion to a point-in-time and more
  • Support for static IP database virtual machines (DBVMs)
  • Express set-up for development or experimentation

Continue reading

3 Steps on Using Spring Insight Developer to Analyze Code

If you don’t know about Spring Insight Developer, this post may save you tons of time and potentially headache.

Imagine that you need to update some code behind a button, but you didn’t write the code. What if you could press the to-be-coded button and then see what code was invoked (including methods and arguments), the SQL invoked, and the time it took to execute?

This is what Spring Insight Developer allows you to do, and more.

It’s also free, and it uses AspectJ and AOP to load-time weave your application, you do not have to make any changes to your application code to use it.

Let’s take a look at a simple example of tracing your app, viewing the details, and seeing the code in action.

Continue reading

3 Signs Your Relational Database Must Go

Application and operations teams sometimes reach a point where they must upgrade the database. Whether it’s due to data growth, lack of throughput, too much downtime, the need to share data globally, adding ETLs, or otherwise, it’s never a small project. Since these projects are expensive, any recommendation requires a solid justification.  This article a) characterizes 3 signs where traditional databases hit a wall, b) explains how vFabric SQLFire provides an advantage over traditional databases in each case, and c) should help you make a case for moving towards an in-memory, distributed data grid based on SQL.

For those of us tasked with upgrading (or architecting) the data layer, we all go through similar steps. We build a project plan, make projections and sizing estimates, perform architecture and code reviews, create configuration checklists, provide hardware budgets and plans, talk to vendors about options, and more.  Then, we work to plan the deployment with the least downtime, procure hardware and software, test different data load times, evaluate project risks, develop back-up plans, prepare communications to users about downtime, etc. You know the drill. These projects can take months and consume a fair amount of internal resources or consulting dollars. If you are starting or working on one of these types of projects with a traditional database architecture in mind, are you considering these 3 signs as you consider your options? Continue reading

Webcast: Big, Fast, Flexible Data with Cloud Delivery

As we’ve previously covered, data growth is quite unbelievable, and this means traditional database models are being stretched. On Tuesday, November 13, 2012 at 9:00 AM PST, VMware’s Joe Russell will be presenting on several topics related to Big, Fast, Flexible Data and how VMWare’s key data management technologies help companies overcome some of the key challenges with traditional RDBMS.

Attend to learn:

  • How Hadoop and new analytics technologies are allowing companies to use Big Data in new ways to gain meaningful business insights
  • What’s new with Project Serengeti, a VMware initiative to help you deploy and manage elastic Hadoop clusters in minutes
  • How Fast Data is bringing data logic in-memory, allowing for dramatic scale, reduced costs, and improved performance
  • How Flexibile Data, includng NoSQL and open source relational data technologies can improve your data model
  • How virtualizing the database layer enables a new Cloud Delivery Model, allowing enterprise IT departments to offer self-service data services elastically on demand, maintain centralized control, and operate within regulatory guidelines

>> Register for the webinar Big, Fast, Flexible Data with Cloud Delivery on Tuesday, November 13, 2012 at 9:00 AM PST.

vFabric Data Director supports Oracle, SQL Server, Hadoop, Postgres – Accelerates database virtualization and Big Data adoption

Virtualization continues to be one of the top priorities for CIOs. As the share of virtualized workloads approaches 60%, the enterprise is looking at database and big data workloads as the next target. Their goal is to realize the virtualization benefits with the plethora of relational database sprawling in their data centers. With the increasing popularity of analytic workloads on Hadoop, virtualization presents a fast and efficient way to get started with existing infrastructure, and scale the data dynamically as needed.

VMware’s vFabric Data Director 2.5 now extends the benefits of virtualization to both traditional relational databases like Oracle, SQL Server and Postgres as well as Big Data, multi-node data solutions like Hadoop. SQL Server and Oracle represent the majority of databases in enterprises, and, Hadoop is the one of the fastest growing data technologies in the enterprise.

vFabric Data Director enables the most common databases found in the enterprise to be delivered as a service with the agility of public cloud and enterprise-grade security and control.

The key new features in vFabric Data Director 2.5 are:

  • Support for SQL Server – Currently supported versions of SQL Server are 2008 R2 and 2012.
  • Support for Apache Hadoop 1.0-based distributions: Apache Hadoop 1.0, Cloudera CDH3, Greenplum HD 1.1, 1.2 and Hortonworks HDP-1. Data Director leverages VMware’s open source Project Serengeti to deliver this capability.
  • Streamlined Data Director Setup – Complete setup in in less than an hour
  • One-click template creation for Oracle and SQL Server through ISO based database and OS installation
  • Oracle database ingestion enhancements – Now includes Point In Time Refresh (PITR)

Data Director’s self-provisioning enables a whole new level of operational efficiencies that greatly accelerates application development. With this new release, Data Director now delivers these efficiencies in a heterogeneous database environment.

Continue reading

VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

Not long ago I covered the topic of Big Data adoption in the enterprise. In it, I described how Serengeti enables enterprise to respond to common Hadoop implementation challenges resulting from the lack of usable enterprise-grade tools and the shortage of infrastructure deployment skills.

With the latest release of open source Project Serengeti, VMware continues on its mission to deliver the easiest and most reliable virtualized Big Data platform. One of the most unique attributes of Serengeti Hadoop deployment is that it can easily coexist with other workloads on an existent infrastructure.

Serengeti-deployed Hadoop clusters can also be configured in either local or shared, scale-out data storage architecture. This storage layer can even be shared across multiple HDFS-based analytical workloads. And, in the future, this could potentially be extended to other, non-HDFS-based data engines.

The elasticity of underlining vSphere virtualization platform, helps Serengeti to achieve new levels of efficiency. This architecture enables organizations to share the existing infrastructure with Big Data analytical workloads to deliver optimal storage capacity and performance. Continue reading