Tuesday, October 6, 2009

Hadoop World 2009

Hadoop is a Java framework for implementing the Map/Reduce programming model described in the 2004 paper by Jeffrey Dean and Sanjay Ghemawat of Google. Map/Reduce allows you to parallelize a large task by breaking it into small map functions that perform a transformation on a key/value pair and combine the results of the mappings using a reduce function. Using a cluster of nodes - either your own or those in a computing cloud like Amazon’s EC2 - you can do large scale parallel processing.


I attended the Hadoop World conference in NYC on October 1, 2009 and came back very excited about this technology. We have had an interest in exploring Map/Reduce to support our work analyzing blog text, but have not until now had the resources to devote to it. After getting a feel for what this technology makes possible, we are ready to dive in. A copy of Tom White’s “Hadoop:The Definitive Guide” was provided, so I am on my way.


The message I took away from thus conference is that with the vast amount of data available today, from genomic and other biological data to data in online social networks, we can use new computational tools to help us better understand people at the micro and macro levels. We need processing capabilities that scale to the pentabyte level to do this, however, and technologies like Hadoop, coupled with cloud computing, are one way to approach this problem. The notion that
“more data beats better algorithms” may or may not hold in all cases, but “big data” is here and we now have usable and relatively cheap ways to process it.


Hadoop grew out of the Lucene/Nutch community and became an official Apache project in 2006. Yahoo! soon afterward adopted it for their Web crawls. Since then a number of subprojects have started up under Hadoop, allowing for querying, analysis and data management.


Cloudera (“Cloud-era”, get it?) was the major organizer of Hadoop World, and Christophe Bisciglia, one of the principals of Cloudera, started off giving us a review of the history of Hadoop, and describing a number of the subprojects it has spawned. Cloudera’s business is based on providing Linux packages for deploying Hadoop on private servers, and maintaining Amazon EC2 instances for use in the cloud. Cloudera also has introduced a browser based GUI for managing Hadoop clusters: the Cloudera Desktop.


Peter Sirota, the manager of Amazon’s Elastic MapReduce, was up next and discussed
Pig, an Apache/Hadoop subproject that provides a high-level data analysis layer for Map/Reduce, and Karmasphere, their own NetBeans based GUI for managing EC2 instances and Map/Reduce jobs.


Eric Baldeschwieler of Yahoo! described their use of Hadoop. Yahoo!, he said, is the “engine
room” of Hadoop development and is the largest contributor to the open source project. They use Hadoop to process the data for the Yahoo! Web search, using a 10,000 core
Lunux cluster and with 5 Pentabytes of raw disk space. They also used Hadoop to win the Jim Gray Sort Benchmark competition, sorting one Terabyte in 62 seconds and one pentabyte
in 16 hours.


Other presenters included Ashish Thusoo of Facebook. The amount of new data added to Facebook on a daily basis is astounding. In March of 2008 it was a modest 200 gigabytes a day. In April of 2009 it was over two terabytes a day, and in October of ’09 it is over four terabytes per day. They have made Hadoop an integral part of their processing pipeline in order to deal with this rate of growth. One new Apache/Hadoop subproject that they have been using is Hive. Hive is a data warehouse infrastructure that allows for data analysis and querying of data in Hadoop files.


There were a number of talks in the afternoon portion of the conference. Two that interested me in particular were Jake Hofman’s talk on Yahoo! Research’s use of Hadoop for social network analysis, and Charles Ward’s talk on a joint effort by Stony Brook University and General Sentiment (a Stony Brook commercial spinoff) on analyzing entity references and sentiment in blogs and online media. One other from Paul Brown of Booz Allen discussed how they used Hadoop for calculating protein alignments. He also demonstrated a visualization of a Hadoop cluster in action that needs to be seen – I’ll see if I can find a link...


There is a rich set of tweets from the conference on Twitter. These give a detailed, minute-by-minute picture of what went on there. The one oddity about the conference – the yellow elephant in the room, if you will – was the absence of Google. There’s an interesting angle to that, I’m sure, but I have no idea what it might be.


This was a great conference and it looks like we are just at the start of a revolution in the way we deal with large volumes of data.

1 comment:

21st Century Software Solutions said...

Hadoop Developer online training| Hadoop Developer ...
http://www.21cssindia.com/courses/hadoop-online-training-182.html
ఈ పేజీని అనువదించు
hadoop developer online training, hadoop developer training, hadoop developer course contents, hadoop developer, hadoop developer enquiry, hadoop ...
Courses at 21st Century Software Solutions
Talend Online Training -Hyperion Online Training - IBM Unica Online Training -
Siteminder Online Training - SharePoint Online Training - Informatica Online Training
SalesForce Online Training - Many more… | Call Us +917386622889

 
Creative Commons License
finegameofnil by Clay Fink is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.