Primary links

ACM看板人物: Jeff Dean 提出MapReduce/BigTable的Google科學家

 Today's Topic: People of ACM: Jeff Dean

Thursday, May 16, 2013

Jeff Dean is a Google Fellow in the Systems Infrastructure Group. Before joining Google, he was at Digital Equipment Corporation's Western Research Laboratories. In 1990 and 1991, Dean worked for the World Health Organization's Global Programme on AIDS, developing software for statistical modeling, forecasting, and analysis of the HIV pandemic. A Fellow of ACM, he is a co–recipient (with Sanjay Ghemawat) of the 2012 ACM—Infosys Foundation Award in the Computing Sciences. He was elected to the National Academy of Engineering in 2009. A summa cum laude graduate of the University of Minnesota with a M.S. degree in Computer Science, he received a Ph.D. degree in Computer Science from the University of Washington.

Research areas include large–scale distributed systems, performance monitoring, compression techniques, information retrieval, application of machine learning to search and other related problems, microprocessor architecture, compiler optimizations, and development of new products that organize existing information in new and interesting ways. Products Jeff has developed for Google include AdSense, MapReduce, BigTable, and Google Translate.

From your perspective as a pioneering software engineer, what do you see as the next big thing in Internet–scale computing?

There's a confluence of factors that I believe will enable very different kinds of applications. First, large–scale machine learning is making significant improvements for perceptual tasks, such as speech recognition, computer vision, and natural language understanding. Second, mobile devices are becoming more powerful and are more likely to have highly–available, high–bandwidth, low–latency connections to the internet. Taken together, this means that mobile devices can take advantage of large–scale processing capabilities in datacenters to understand more about the user's environment, and offer them assistance and additional information based on their current context at (hopefully) the precise moment that the user wants the information.

What impact did your research on scalable infrastructure have on the advent of cloud computing?

Much of our work on scalable infrastructure arose out of necessity in building and operating Google's crawling, indexing, and query serving systems. We were dealing with large datasets, and the most cost-effective way to get enough computational power was to buy lots and lots of relatively inexpensive, "commodity–class" computers, and to build software abstraction layers that make reliable systems out of these somewhat unreliable and modestly–sized individual machines. The systems that we have built solved significant problems we were encountering along the way. Several of my colleagues built a large-scale distributed file system (Google File System, or GFS), Sanjay Ghemawat and I developed an abstraction called MapReduce for doing reliable computations across thousands of machines, and later I and many other colleagues worked on higher-level storage systems such as BigTable and Spanner.

Our work has had some impact because other organizations, groups, and even other academic disciplines face many of the same problems that we faced when trying to store and process large datasets, and our approaches have turned out to be useful in these other environments.

Do you regard the open publication of your work as an opportunity for indirect technology transfer, and if so, how has it influenced the way other software engineers design and program large-scale IT systems?

Certainly. In some cases we can release open source versions of the software systems we've built, but sometimes we aren't able to easily open-source entire systems. Publishing papers that describe such systems in enough detail that others can use the general principles to create open source versions is one effective means of technology transfer. Publications can also sometimes get others to take similar strategies to tackle problems in other domains.

What advice would you give to budding technologists in the era of big data who are considering careers in computing?

Large datasets and computational challenges are cropping up in more and more domains. For example, the current modest–sized stream of genetic data is going to turn into an incredible flood because the costs of DNA sequencing are dropping exponentially. Computer systems and algorithms that can find interesting correlations in such data are going to really make important scientific discoveries about the genetic basis of diseases, and will make medicine much better targeted and personalized. That's just one example. Having a strong computer science background and being able to apply it to solve important computational problems is an incredibly exciting way for computer scientists to have tremendous impact on the world.



Hadoop Ecosystem vs. BDAS stack

看完下面有關於BDAS的介紹你認為Hadoop/MapReduce的Ecsystem會被BDAS取代,還是兩者會共生共存( )。如果是共生共存,則什麼情況下有選擇Hadoop的系統平台,另外的情況下要選擇BDAS Stack平台。請記得兩者都是Open Source平台架構,甚至於可能都有被Apache Foundation所背書:

Hadoop EcoSystem是我們過去進行巨量資料分析時系統建構的主要參考依據。最近我發現加州柏克萊大學AMP Lab( )所提出的Berkely Data Analytics Stack (BDAS)正在形成巨大的能量,因為BDAS的架構除了可以和現有的Hadoop EcoSystem相通之外,他們也宣稱其運作效能能夠勝過既有的Hadoop EcoSystem. 從投影片我們可以知道美國知名的大學和資訊公司都有支助這項研究計畫案。這套BDAS系統架構主要目的是提供巨量資料三種分析:interactive query, streaming query analytics, batch query analytics一體成形的環境。在這份投影片上,Ion舉出系統安全分析在這三種分項上所代表不同的意義。這個BDAS系統已經陸陸續續發展出其相關的模組,並且開放全世界的巨量資料分析者來使用其API。透過其API的呼叫,我們要進行巨量資料分析的程式碼因此可以精簡很多。

Ion Stoica教授對於BDAS在2013 Strata (巨量資料分析的最知名研討會之一)介紹的投影片:

AMP Lab.過去4次舉辦其訓練營的網址與相關資料(含影片,投影片)

其中SparkR (Spark + R)可以類比成我們的Hadoop/MapReduce + R的組合。有機會文友和宗哲是否可以參考其系統建置說明並且評估決定是否要選用這一套操作環境?

GraphLab/GraphX and RDF(S) graph

GraphLab and GraphX的運作方式我已經有深入的去研讀與瞭解,主要是因為要回應國科會計畫初審委員對於申請計畫案的評論。你可以參考重要的網址:
GraphX Programming:

簡單的來說原來評審引述的是GraphLab,它的graph-parallel處理速度快是因為它把natual graph(和RDF(S)/FOAF的property graph不一樣)用node切割方式把圖形切割好之後分散到多重主機之上去運作,因此就單純的graph-parallel的運算指標計算如PageRank,即使在運算過程中有交換資訊,但是還是能夠比傳統的MapReduce快很多,在graph-parallel只是事先把圖檔切割好之後,load到memory之上進行運算,因此不會有太多的disk-IO的動作。我如果沒有記錯GraphLab是用Shared-Memory的多重主機架構,所有多重運算資料交換的速度會比一般多重主機cluster用distributed-memory的架構快很多。但是對於data-parallel的處理GraphLab則使不上力。至於data-parallel的使用時間則是會有來回disk-IO處理的工作,如可能要從多個圖形資料源彙整資料來產生新的graph,並且對於現有的graphs做出轉換,清洗與再載入的ELT (Extract, Load, Transform)動作。而這也是所謂完整的big data analytics pipeline (or graph data analytics)會碰到的情境。

因此就有柏克萊大學AMP Lab的GraphX來解決上述的問題。事實上GraphX僅是BDAS架構中的某一個模組,它主要是針對於圖形巨量資料處理與分析的目的而建置的。GraphX可以同時解決graph-parallel和data-parallel整合的問題。

至於RDF(S)的property graph將面臨一個graph裡面的node, edge同時會含有多個properties對應其上,雖然真正RDF(S) triples是將這些RDF(S) graph用不同的nodes, edges, nodes的triples來加以區隔,但是在處理上和原有的單純natural graph透過iterative processing來tranvers的方式有所不同。

微軟北京研究院最近有一篇論文,是在其Trinity graph engine之上發展出一套Trinity.RDF來處理RDF(S) graph, 作法上是在RDF(S) graph之上抽取出natural graph的架構在用現有natural graph處理的方式來進行,他們宣稱可以比一般Sparql查詢時用subgraph triples matching的方式快速很多。實際上BDAS中的GraphX也提供有對於類似RDF(S) property graph處理的模組,就是上述GraphX programming的mrTriplets



R software information

另外對於R的平行化處理和Graph透過vertex (or node)切割平行化處理的概念不盡相同。R程式是透過lapply 以及其它相類似apply ( ),我們可以在SparkR中瞭解它如何來進行資料集的切割和處理後來進行平行化處理( ),我相信RHadoop也是相類似的parallel R的概念。至於Graph的平行化處理其顆粒應該是比較細,而且是針對於現有大的架構圖本身進行點(vertex)或線(edge)的切割。當面臨Graph的結構是呈現power law distribution的時候,找出哪幾個點是重要的中心點來做切割將會使計算的工作量更均勻的地方( )。


針對R 軟體系統架構的概念,從電腦科學角度來看我們比較有興趣的是R的軟體模組的設計理念與背後的系統架構運作機制,可以先參考How R Searches and Finds Stuff這一篇短文: ,然後再下載約500頁的參考手冊 Software Data Analysis: Programming with R, 作者John M. Chambers是R的前身S語言的原創者之一,因此由他來敘述R的軟體系統結構是再適當也不過的一件事。本書電子檔是2008年 Springer發行的,如附件。

另外R操作使用的軟體可以用Revoution Analytics的改良版,因為學術界的老師和學生個人使用可以完全免費:


Powered by Drupal 5.5 and copyright © 新趨勢網路科技實驗室 ( Emerging Network Technology Laboratory ), Some Rights Reserved
This work is licensed under a Creative Commons License.