Hadoop上Data Locality的詳解

更新時(shí)間：2017年10月25日 10:54:44 作者：csguo007

這篇文章主要介紹了 Hadoop上Data Locality的詳解的相關(guān)資料,希望通過本文能幫助到大家，讓大家理解掌握這部分內(nèi)容，需要的朋友可以參考下

Hadoop上Data Locality的詳解

Hadoop上的Data Locality是指數(shù)據(jù)與Mapper任務(wù)運(yùn)行時(shí)數(shù)據(jù)的距離接近程度（Data Locality in Hadoop refers to the“proximity” of the data with respect to the Mapper tasks working on the data.）

1. why data locality is imporant?

當(dāng)數(shù)據(jù)集存儲(chǔ)在HDFS中時(shí)，它被劃分為塊并存儲(chǔ)在Hadoop集群中的DataNode上。當(dāng)在數(shù)據(jù)集執(zhí)行MapReduce作業(yè)時(shí)，各個(gè)Mappers將處理這些塊（輸進(jìn)行入分片處理）。如果Mapper不能從它執(zhí)行的節(jié)點(diǎn)上獲取數(shù)據(jù)，數(shù)據(jù)需要通過網(wǎng)絡(luò)從具有這些數(shù)據(jù)的DataNode拷貝到執(zhí)行Mapper任務(wù)的節(jié)點(diǎn)上（the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task）。假設(shè)一個(gè)MapReduce作業(yè)具有超過1000個(gè)Mapper，在同一時(shí)間每一個(gè)Mapper都試著去從集群上另一個(gè)DataNode節(jié)點(diǎn)上拷貝數(shù)據(jù)，這將導(dǎo)致嚴(yán)重的網(wǎng)絡(luò)阻塞，因?yàn)樗械腗apper都嘗試在同一時(shí)間拷貝數(shù)據(jù)（這不是一種理想的方法）。因此，將計(jì)算任務(wù)移動(dòng)到更接近數(shù)據(jù)的節(jié)點(diǎn)上是一種更有效與廉價(jià)的方法，相比于將數(shù)據(jù)移動(dòng)到更接近計(jì)算任務(wù)的節(jié)點(diǎn)上（it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation）。

2. How is data proximity defined?

當(dāng)JobTracker（MRv1）或ApplicationMaster（MRv2）接收到運(yùn)行作業(yè)的請(qǐng)求時(shí)，它查看集群中的哪些節(jié)點(diǎn)有足夠的資源來執(zhí)行該作業(yè)的Mappers和Reducers。同時(shí)需要根據(jù)Mapper運(yùn)行數(shù)據(jù)所處位置來考慮決定每個(gè)Mapper執(zhí)行的節(jié)點(diǎn)（serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located）。

3. Data Local

當(dāng)數(shù)據(jù)所處的節(jié)點(diǎn)與Mapper執(zhí)行的節(jié)點(diǎn)是同一節(jié)點(diǎn)，我們稱之為Data Local。在這種情況下，數(shù)據(jù)的接近度更接近計(jì)算（ In this case the proximity of the data is closer to the computation.）。JobTracker（MRv1）或ApplicationMaster（MRv2）首選具有Mapper所需要數(shù)據(jù)的節(jié)點(diǎn)來執(zhí)行Mapper。

4. Rack Local

雖然Data Local是理想的選擇，但由于受限于集群上的資源，并不總是在與數(shù)據(jù)同一節(jié)點(diǎn)上執(zhí)行Mapper（Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster）。在這種情況下，優(yōu)選地選擇在那些與數(shù)據(jù)節(jié)點(diǎn)在同一機(jī)架上的不同節(jié)點(diǎn)上運(yùn)行Mapper（ In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data.）。在這種情況下，數(shù)據(jù)將在節(jié)點(diǎn)之間進(jìn)行移動(dòng)，從具有數(shù)據(jù)的節(jié)點(diǎn)移動(dòng)到在同一機(jī)架上執(zhí)行Mapper的節(jié)點(diǎn)，這種情況我們稱之為Rack Local。

5. Different Rack

在繁忙的群集中，有時(shí)Rack Local也不可能。在這種情況下，選擇不同機(jī)架上的節(jié)點(diǎn)來執(zhí)行Mapper，并且將數(shù)據(jù)從具有數(shù)據(jù)的節(jié)點(diǎn)復(fù)制到在不同機(jī)架上執(zhí)行Mapper的節(jié)點(diǎn)。這是最不可取的情況。

如有疑問請(qǐng)留言或者到本站社區(qū)交流討論，感謝閱讀，希望能幫助到大家，謝謝大家對(duì)本站的支持！

您可能感興趣的文章: