Hadoop Professionals

A Community for Hadoop Users

This network is a place to discuss and learn Hadoop, Solr, Katta, Map Reduce and a place to discuss hadoop resources such as hadoop books.

Members

  • Jason Venner
  • Zhiqiang Ma
  • Uday Kurkure
  • Saranyu Netmanee
  • Alexey Tigarev
  • wang zhengkui
  • stack
  • Stefan Groschupf
  • Shevek
  • Sridhar
  • Ram
  • florian Leibert
  • Tom White
  • G Sondeep
  • Aaron Kimball
  • Jon Baer

Latest Activity

Jason Venner added an event
Bay Area Hadoop User Group (HUG) February Meetup at Yahoo! Sunnyvale Campus Building E - Classrooms 9 and 10
February 17, 2010 from 6pm to 9pm
Hello Hadoopers RSVPs is open for the February Bay Area Hadoop user group at Yahoo!'s Sunnyvale campus. Agenda: 6:00 - 6:15 - Socializing and Beers 6:15 - 7:00 - LZO Compression and Protocol Buffers: Efficient, Flexible Data Processing with Had…
8 hours ago
I figured out the problem already. Thanks
21 hours ago
Unless you explicitly set it, you will get TextInputFormat for your inputformat, the keys are LongWritable. If you want a text key text writable, job.setInputFormat(KeyValueTextInputFormat.class) in your main. or change the key that your mapper t…
23 hours ago
Zhiqiang Ma added a discussion
Hi All, I am a beginner of Hadoop. I modified the Inverted Index Code in Yahoo's Tutorial  (http://developer.yahoo.com/hadoop/tutorial/module4.html#solution), but I always get errors of "java.io.IOException: Type mismatch in key from map: expected…
on Thursday
Zhiqiang Ma is now a member of Hadoop Professionals
on Thursday
Saranyu Netmanee added a discussion
I writed program count website from log file which I send parameter when run program( bin/hadoop jar webcount.jar WebCount webtudom.log output tudom ) tudom is third parameters which in wordcount when run program will send 2 parameter are in and out…
on Thursday
Alexey Tigarev I develop Hadoop/MapReduce applications. I only work remotely (telecommute). The hourly rate is $29. Message me with your proposition.
on Wednesday
A group for HBase users to share use cases, solutions and problems.
on Wednesday
1) hadoop is traditionally the hadoop file system, hdfs, and the hadoop mapreduce service, combined 2) hbase can be used as a data store, it like all data stores has access patterns that work well and access patterns that do not. 3) read the google…
on Wednesday
nijil added a discussion
i have read about basic stuff about hadoop..err i have a few douts...mind u am a begginer 1:so is hadoop a file sytem only? 2:can hbase be used instead of other databases in other platforms(eg java)? 3:what is mapreduce exactly and hw is it relat…
February 1
nijil is now a member of Hadoop Professionals
February 1
RJ updated their profile
January 28
RJ is now a member of Hadoop Professionals
January 28
January 28
For all intents and purposes your reduce doesn't start until the reduce % hits 60% the parts that run prior to that are involved in preparing the data for your reduce tasks. It the job output is a confusing information presentation.
January 27
angela, Adeel, Shevek and 1 more joined Hadoop Professionals
January 27

Photos

 

Help With Hadoop

A great place to learn Hadoop, and to tune your map reduce jobs.

Ask specific Hadoop questions here to get help from an expert :)

Forum

Zhiqiang Ma

Type mismatch error 2 Replies

Started by Zhiqiang Ma. Last reply by Zhiqiang Ma 21 hours ago.

Saranyu Netmanee

How to run and writting program on cluster?

Started by Saranyu Netmanee Feb 4.

nijil

help on hadoop for a beginner 1 Reply

Started by nijil. Last reply by Jason Venner Feb 3.

Blog Posts

Mark Cejas

.bashrc file error

Hello all,

I hope that the holidays are going well,
I finally have my graduate school work behind me and have more time to learn about this wonderful Hadoop tool. I work on a Fedora 11 distribution and upon getting my JAVA_HOME and HADOOP_HOME paths set, I started to encouter the following error. The error is is observed upon establishing root user as follows:

[rasaan@rasaan ~]$ su
Password:
bash: /root/.bashrc: line 9: unexpected EOF while looking for matching `)'
bash: /root/.bashrc: line 14… Continue

Posted by Mark Cejas on December 31, 2009 at 12:23pm — 1 Comment

Jason Venner

I am giving a talk at the HUG on Wed, scaling search with hadoop, katta and solr

Jason Rutherglen will be providing the in depth lucene/solr pieces.

Hope to see you there.

Posted by Jason Venner on November 17, 2009 at 12:57pm

dekel tankel

Hadoop Bay Area User Group - November 18th at Yahoo!

Hi Hadoopers

You are welcome to join us for the next bay area hadoop user groups at the Yahoo! Sunnyvale Campus - Wed, Nov 18th at 6PM.

We have some interesting talks planed:

*Katta, Solr, Lucene and Hadoop - Searching at scale, Jason Rutherglen and Jason Venner

*Walking through the New File system API, Sanjay Radia, Yahoo!

* Keep your data in Jute but still use it in python, Paul Tarjan, Yahoo!


Please RSVP here:
http://www.meetup.com/hadoop/calendar/11724002/


see you there
Dekel

Posted by dekel tankel on November 9, 2009 at 9:29am

Jason Venner

Thanks to Stephane for a fun Katta Meetup last night.

There were good discussions on Katta, Solr machine learning and general machine performance

Posted by Jason Venner on September 30, 2009 at 7:29am

Jason Venner

Cloudera folds Hbase into their 0.20 hadoop distribution

Per Michael Stack,

Our Andrew Purtell working with Chad Metcalf over at Cloudera have added HBase to the CDH2 Cloudera distribution. Andrew has a guest blog over on Cloudera here: http://su.pr/27zIMw St.Ack

Enjoy!

Posted by Jason Venner on September 29, 2009 at 8:22am

Yahoo Hadoop Developer Blog

Hadoop Bay Area User Group - Feb 17th at Yahoo!, Sunnyvale

Hi Hadoopers,

Yahoo! is hosting the monthly Bay Area Hadoop User Group on Wednesday, February 17th 6PM at the Yahoo! Sunnyvale Campus. Whether you are an active submitter, developing using Hadoop related technologies or completely new to Hadoop -- we'd love to see you.

We are hosting Kevin Weil who leads the analytics team at Twitter. Kevin will provide an overview of Hadoop and Pig at Twitter. Kevin will cover LZO Compression and Protocol Buffers and how they are combined with PIG for flexible data storage and fast Map-Reduce jobs.

We are hosting Mathias Stearn from 10Gen who will provide introduction to MongoDB and Hadoop. Mathias will explain how to get started with CRUD and JavaScript, how to create schemas and how to scale. He will also describe the integration between MongoDB and Hadoop.

Registration and additional session information is available on the Bay Area HUG Meetup page

Looking forward to see you there and for those of you who can't attend in person, stay tune! we will publish the slides and video recording after the event.

Dekel Tankel
Director, Product Management
Cloud Computing at Yahoo!

Continue

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines

I have been asked by users who are going to construct a data pipeline whether they should use Pig Latin or SQL.

For those of you who are not familiar with Pig, it is a platform for analyzing large data sets. It is built on Hadoop and provides ease of programming, optimization opportunities and extensibility. Pig Latin is the relational data-flow language and is one of the core aspects of Pig.

In this blog I refer to "data pipeline" as the means by which applications that take data from one or more sources, cleanse it, do some initial transformation on it that all the data readers will need, and then store it in a data warehouse . As SQL is known by almost everyone, it is often chosen as the language in which to write these data pipelines.

We are comparing Pig Latin over Hadoop to SQL over a relational database.

SQL's ubiquity is convenient. However, I believe that Pig Latin is a more natural choice for constructing data pipelines, for several reasons:

  1. Pig Latin is procedural, where SQL is declarative.
  2. Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
  3. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
  4. Pig Latin supports splits in the pipeline.
  5. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.

I will consider each of these points in turn.

Pig Latin is Procedural

Since Pig Latin is procedural, it fits very naturally in the pipeline paradigm. SQL on the other hand is declarative. Consider, for example, a simple pipeline, where data from sources users and clicks is to be joined and filtered, and then joined to data from a third source geoinfo and aggregated and finally stored into a table ValuableClicksPerDMA. In SQL this could be written as:

insert into ValuableClicksPerDMAselect dma, count(*)
from geoinfo join (
                select name, ipaddr
                from users join clicks on (users.name = clicks.user)
                where value > 0;
            ) using ipaddr
group by dma;

The Pig Latin for this will look like:

Users                = load 'users' as (name, age, ipaddr);Clicks               = load 'clicks' as (user, url, value);
ValuableClicks       = filter Clicks by value > 0;
UserClicks           = join Users by name, ValuableClicks by user;
Geoinfo              = load 'geoinfo' as (ipaddr, dma);
UserGeo              = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA                = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

Notice how SQL forces the pipeline to be written inside-out, with operations that need to happen first happening in the from clause sub-query. Of course this can be resolved with the use of intermediate or temporary tables. Then the pipeline becomes a disjointed set of SQL queries where ordering is only apparent by looking at a master script (written in some other language) that sews all the SQL together. Also, depending on how the database handles temporary tables, there may be cleanup issues to deal with. In contrast, Pig Latin shows users exactly the data flow, without forcing them to either think inside out or construct a set of temporary tables and manage how those tables are used between different SQL queries.

The pipeline given above is obviously simple and contrived. It consists of only two very simple steps. In practice data pipelines at large organizations are often quite complex, if each Pig Latin script spans ten steps then the number of scripts to manage in source control, code maintenance, and the workflow specification drops by an order of magnitude.

Checkpointing Data

Experienced data pipeline developers will object to the point above about Pig Latin not needing temporary tables. They will note that storing data in between operations has the advantage of check pointing data in the pipeline. That way, when a failure occurs, the whole pipeline does not have to be rerun. This is true. Pig Latin allows users to store data at any point in the pipeline without disrupting the pipeline execution. The advantage that Pig Latin provides is that pipeline developers decide where appropriate checkpoints are in their pipeline rather than being forced to checkpoint wherever the semantics of SQL imposes it. So, if for the above pipeline there was a need to store data after the second join (UserGeo) and before the group by (ByDMA), the script could be changed to:

Users                = load 'users' as (name, age, ipaddr);Clicks               = load 'clicks' as (user, url, value);
ValuableClicks       = filter Clicks by value > 0;
UserClicks           = join Users by name, ValuableClicks by user;
Geoinfo              = load 'geoinfo' as (ipaddr, dma);
UserGeo              = join UserClicks by ipaddr, Geoinfo by ipaddr;
store UserGeo into 'UserGeoIntermediate';
ByDMA                = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

This would result in no additional Map Reduce jobs. Pig would store the intermediate data after the aggregation and continue with the join as before.

Faith in the Optimizer

By definition, a declarative language allows the developer to specify what must be done, not how it is done. Thus in SQL users can specify that data from two tables must be joined, but not what join implementation to use. Developers are forced to have faith that the optimizer will make the right choice for them. Some databases work around this by allowing hints to be given to the optimizer, but even then the implementation is not required to follow those hints.

While for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm, this is not usually the case for data pipelines. Data flowing through data pipelines does not tend to vary significantly from run to run, in terms either of volume or key distribution. In addition data pipeline developers are usually sophisticated enough to choose the correct algorithm. For these reasons allowing developers to explicitly choose an implementation, and be guaranteed that their choice will be honored, is quite useful in data pipelines.

Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. For joins and grouping operations users can specify an implementation to use, and Pig guarantees that it will use that implementation. Currently Pig supports four different join implementations and two grouping implementations. It also allows users to specify parallelism of operations inside a Pig Latin script, and does not require that every operator in the script have the same parallelization factor. This is important because data sizes often grow and shrink as data flows through the pipeline.

Splits in Pipelines

Another common feature of data pipelines is that they are often graphs (DAGs) and not linear pipelines. SQL, however, is oriented around queries that produce a single result. Thus SQL handles trees (such as joins) naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. A very common use case we have seen in Yahoo! is a desire to read one data set in a pipeline and group it by multiple different grouping keys and store each as separate output. Since disk reads and writes (both scan time and intermediate results) usually dominate processing of large data sets, reducing the number of times data must be written to and read from disk is crucial to good performance.

Take for example a user data set where there is a desire to analyze the data set both in geographic and demographic dimensions. The Pig Latin to do this analysis looks like:

Users         = load 'users' as (name, age, gender, zip);Purchases     = load 'purchases' as (user, purchase_price);
UserPurchases = join Users by name, Purchases by user;
GeoGroup      = group UserPurchases by zip;
GeoPurchase   = foreach GeoGroup generate group, SUM(UserPurchases.purchase_price) as sum;
ValuableGeos  = filter GeoPurchase by sum > 1000000;
store ValuableGeos into 'byzip';
DemoGroup     = group UserPurchases by (age, gender);
DemoPurchases = foreach DemoGroup generate group, SUM(UserPurchases.purchase_price) as sum;
ValuableDemos = filter DemoPurchases by sum > 100000000;
store ValuableDemos into 'byagegender';

This Pig Latin script describes a DAG rather than a pipeline. It starts with two inputs which are brought into one stream (via join) which is then split into two streams. Pig will do this in two Map Reduce jobs (one for the join and one for both group bys and their filters) rather than requiring that the join be either run twice or materialized as an intermediate result as traditional SQL would.

Inserting Developer Code

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. This is accomplished through user defined functions (UDFs) and streaming. UDFs allow users to specify how data is loaded, how it is stored, and how it is processed. Streaming allows users to include executables at any point in the data flow.

Allowing developers to specify how data is loaded is useful because in most data pipelines data sources are not database tables. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin. There are many ETL tools on the market to handle this import process for databases. Pig allows developers to write a function in Java to read data directly from the source. This eliminates the need for a second tool which must be purchased, learned, and used and allows the data pipeline to combine the loading and initial cleansing and transformation steps.

Pipelines also often include user defined column transformation functions and user defined aggregations. Pig Latin supports writing both of these types of functions in Java. We plan to extend that to a number of scripting languages in the near future, thus enabling users to easily write UDFs in the language of their choice.

If the user defined code will not fit well into a UDF, streaming allows pipelines to place an executable in the pipeline at any point. This can also be used to include legacy functionality that cannot be modified.

To conclude, I hope you will agree with me that these advantages of an intuitive, procedural programming model, control of where data is check pointed in the pipeline, the ability to completely control how data is processed, support for general DAGs, and the ability to include user code wherever necessary make Pig Latin a better choice for developing data pipelines on Hadoop.

Alan Gates, Architect
Pig Development Team, Yahoo!
Continue

Video from Jan. 20, 2010 Hadoop Bay Area User Group now online

The videos from the last Hadoop Bay Area User Group are now available on the recap post. Slides from the presentations are also on that post.

 

Continue

Stomping out Java "concurrency cockroaches" with SureLogic's Flashlight and JSure tools

Concurrency errors are the cockroaches of the software bug ecosystem. They are difficult to detect and tend appear at the most inopportune times. And, when you try to shine a light on them, they go into hiding. When supplying widely-used infrastructure software, like Hadoop, it is doubly important to stomp out them out and keep them from coming back. To date, there has been a shortage of tools focused on helping programmers with this daunting task. A small Pittsburgh company, SureLogic, is trying to address this tool shortage with a suite of Java-based concurrency-focused static and dynamic analysis tools. SureLogic has extensive experience in Java concurrency and in analysis of large and complex Java code bases.

SureLogic JSure is a model-based static analysis tool that helps developers gain confidence in their multi-threaded Java code, regardless of scale or complexity. JSure provides positive assurance (sound analysis, not rule-based) that correct locks are held when shared state is accessed. This tool helps the programmer answer the question "Are my threads accessing shared state in a safe way?" The SureLogic JSure modeling language is open source under the Apache license. Even when the JSure tool is not used, developers have found the annotation-based models very useful for documentation purposes.

SureLogic Flashlight is a dynamic analysis tool that acts as a concurrency-focused runtime profiler that illuminates threading behavior and access to shared state. When developers are in the dark about why their application is experiencing intermittent failures, poor performance, or data corruption, Flashlight provides visibility. SureLogic is working with Carnegie Mellon University to enhance Flashlight to bridge from development into distributed monitoring.

Four SureLogic engineers visited Yahoo! Sunnydale on October 28, 29, and 30 and worked with members of the MapReduce, Zookeeper, and HDFS teams. The SureLogic engineers were Tim Halloran, Aaron Greenhouse, Edwin Chan, and Nathan Boy. On the Yahoo! side, Konstantin (Cos) Boudnik hosted the SureLogic team. Cos works on HDFS. The other HDFS engineers who participated were Konstantin Shvachko, Hairong Kuang, Jakob Homan, and Boris Shkolnik. The Zookeeper engineers who participated were Pat Hunt, Mahadev Konar, and Ben Reed. The MapReduce engineers who participated were Dick King, Chris Douglas, Owen O'Malley, and Hong Tang. Nigel Daley also participated.

SureLogicGuys.jpg
Above:SureLogic engineers Edwin Chan, Aaron Greenhouse, and Nathan Boy in front of Building E in Sunnyvale.

During the visit Yahoo! and SureLogic engineers worked side-by-side in several conference rooms. JSure and Flashlight were run on the Hadoop code with the SureLogic team providing expertise and instruction on JSure and Flashlight tools and the Yahoo! engineers providing a deep understanding about the code they work on and the environment it is developed within. Typically using a projector was used to allow everyone in the room to see the tool results.

Flashlight

To work with Flashlight programmers run an instrumented version of their program automatically created by Flashlight. The data collected from the program run can be queried in a very general way. Flashlight currently supports 47 concurrency-focused queries as well as custom queries created by users. These range from informational queries such as "What fields were observed to be shared between threads?" "Where were two or more locks held at the same time?" to focused queries to uncover race conditions and the potential for the program to deadlock such as "What locks could potentially deadlock?" and "What shared fields are not protected by a consistent lock?" The Flashlight tool, which is hosted within the Ecilpse Java IDE, is shown in the screenshot below.

SureLogic1.png
Above:Sample Flashlight tool output on a run of HDFS TestFileAppend2. The query shows non-final non-static fields that were observed to be shared between one or more threads during the run. A lock contention query (highly contested locks can impact program performance) and a potential deadlock query have been run on the data.

The Yahoo! engineers thought that Flashlight was "cool/powerful/useful" but also provided a list of ideas to improve the tool and make it more useful on Hadoop. In particular the ability to focus on "what changed" from one run to another (e.g., deadlocks or race conditions) would help make the tool more useful. SureLogic is working on these improvements to Flashlight.

JSure

To work with JSure programmers place a few Java 5 annotations (or Javadoc if 1.4 code) in their code that are then checked by the tool. These annotations represent programmer design intent about the program that cannot be readily inferred from the program's code. JSure checks that the annotations are consistent with the code. A few sample JSure annotations are shown below.

@Regions({   @Region("HeartbeatState"),
   @Region("DatanodeState")
})
@RegionLocks({
   @RegionLock("HeartbeatLock is heartbeats protects HeartbeatState"),
   @RegionLock("DatanodeLock is datanodeMap protects DatanodeState")
})
public class FSNamesystem implements FSConstants, FSNamesystemMBean,
FSClusterStats { ... }

These annotations define two regions of the HDFS program's state, HeartbeatState and DatanodeState, and two locking models, HeartbeatLock and DatanodeLock. Fields are added to a declared regions using annotations at the field declarations. For example,

@InRegion("HeartbeatState") private int totalLoad = 0;

adds the field totalLoad to the region HeartbeatState.

The @RegionLock annotation defines a lock that is intended to protect a region of the program's state. For example,

@RegionLock("DatanodeLock is datanodeMap protects DatanodeState")

states that a lock on the object referenced by the field datanodeMap protects the program's state defined by the DatanodeState region. JSure checks that the annotations, which define a model of programmer design intent, are consistent with the code. An example of the tool output for the DatanodeLock model is shown in the screenshot below.

SureLogic2.png
Above: Sample JSure tool output for the DatanodeLock model annotated in FSNamesystem (part of HDFS). The green results indicate areas of model-code consistency the red results indicate inconsistency. The goal is to keep the code and the annotated model consistent.

JSure supports 22 kinds of annotations that help a programmer precisely document concurrency design intent about his or her code. These annotations are useful even without the tool and are described at http://www.surelogic.com/promises/apidocs/index.html

The Yahoo! engineers liked the annotations and their ability to formally express concurrency design intent. They had suggestions to improve the tool experience and make it more useful to them. SureLogic is working on these improvements to JSure. In particular the ZooKeeper team found JSure to be extremely impressive. "We found a number of significant issues with just a few hours of work. We really like the iterative approach. We really like the start with nothing approach (We hate tools that spew thousands of problems that are not actionable). We like the idea that JSure can be integrated into our build and run as part of the patch process." They also noted that, "The annotations need to be standardized." SureLogic has open sourced the JSure annotations and is involved with the JSR process.

Since the SureLogic visit to Sunnyvale, SureLogic has released the JSure annotations into the Maven global repository and JIRA requests are pending to introduce the annotations into the Hadoop build process. Cos and Edwin Chan are working on including JSure into the patch system used by Hadoop (similar to FindBugs).

SureLogic is supporting the Hadoop community by providing licenses and support for their tools and has open sourced the JSure annotations. The tools can be downloaded from http://www.surelogic.com/static/eclipse/install.html and Hadoop community members read the post at http://wiki.apache.org/hadoop/HowToUseConcurrencyAnalysisTools about the tools and obtain the license from there. The SureLogic tools have several tutorials that should get you up and running quickly. You can also get help directly from SureLogic if you have any problems. SureLogic has a public Bugzilla at http://www.surelogic.com/bugzilla/index.cgi for tracking user issues (focused on the annotations and their effective use as well as the tools).

A great presentation "Racing Toward Disaster" discussing concurrency traps and pitfalls given by Tim during SureLogic visit to Sunnyvale can be found at http://www.scribd.com/doc/25448092

SureLogic would like to thank Cos for hosting us and Don McGillen for his help in getting us to Sunnyvale. Thanks!

Continue

Hadoop Bay Area January 2010 User Group - Recap

Hi Hadoopers

Thanks everyone for joining us last night at the Yahoo!’s Sunnyvale campus. There were close to 150 attendees, a nice way to start the meetings for 2010. I was happy to see familiar and many new faces. It was also great to see the thriving conversations and solution sharing..

 

For those of you who were unable to attend in person the session's details, slides and video recordings are posted below

Bhupesh Bansal, Senior Engineer at LinkedIn shared the details behind Project-Voldemort (distributed key-value storage system based on the Amazon Dynamo project), challenges, performance, features and more. Bhupesh reviewed the growing use of Hadoop for Batch Computing at Linkedin - data store, workflows, ETL, prototyping and more.
Bhupesh is a member of LinkedIn's search and data platform team and an active commiter for Project-Voldemort

 

Slides:

 

Video:

Continue

Cloudera Hadoop Blog

Cloudera speaks VMware vCloud API, too.

We’ve announced, with VMware, the ability to use third-party vCloud Express service providers and the vCloud API to run Cloudera’s Distribution for Hadoop. We think this is interesting; as cloud services proliferate, it’s important to be able to move easily among public and private clouds. vCloud makes that easier and VMWare is working hard to [...]

Hadoop World: Building Data Intensive Apps with Hadoop and EC2

Today’s Hadoop World Talk comes from Pete Skomoroch, and dives into detail about how he built TrendingTopics.org using Hadoop and EC2.

Hadoop World: Making Hadoop Easy on Amazon Web Services

Today’s Hadoop World talk comes from Peter Sirota, who leads Amazon Web Service’s Elastic MapReduce team. In this talk, Peter provides more detail on the platform, shares some new features, and shows how the AWS community, from customers to developers, are making things easier with Hadoop.

Hadoop World: Hadoop Applications at Yahoo!

Today’s Hadoop World talk comes from Eric Baldeschwieler, Yahoo!’s VP of Hadoop Development. In this talk, Eric highlights Yahoo’s contributions to development and testing of Hadoop at scale, and goes into detail about how Yahoo! uses Hadoop to deliver several popular services. A major thanks to Eric, and everyone else at Yahoo! for their [...]

7 Tips for Improving MapReduce Performance

One service that Cloudera provides for our customers is help with tuning and optimizing MapReduce jobs. Since MapReduce and HDFS are complex distributed systems that run arbitrary user code, there’s no hard and fast set of rules to achieve optimal performance; instead, I tend to think of tuning a cluster or job much like a [...]
 
 

Groups

Badge

Loading…
 

© 2010   Created by Jason Venner on Ning.   Create a Ning Network!

Badges  |  Report an Issue  |  Privacy  |  Terms of Service

Sign in to chat!