Dr. Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), and Yahoo!. Currently, he works on distributed data systems at LinkedIn Corp.
Parallel programmer, data-intensive supercomputing.
November 10 11:30AM
Apache Hadoop makes it extremely easy to develop parallel programs based on MapReduce programming paradigm by taking care of work decomposition, distribution, assignment, communication, monitoring, and handling intermittent failures. However, developing Hadoop applications that linearly scale to hundreds, or even thousands of nodes, requires extensive understanding of Hadoop architecture and internals, in addition to hundreds of tunable configuration parameters. In this talk, I illustrate common techniques for building scalable Hadoop applications, and pitfalls to avoid. I will explain the seven major causes of sublinear scalability of parallel programs in the context of Hadoop, with real-world examples based on my experiences with hundreds of production applications at Yahoo! and elsewhere. I will conclude with a scalability checklist for Hadoop applications, and a methodical approach to identify and eliminate scalability bottlenecks.