Here is your pdf: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

The length of the document below is: 14 page(s) long

The self-declared author(s) is/are:

www.usenix.org

The subject is as follows:
Subject: Original authors did not specify.

The original URL is: LINK

The access date was:
Access date: 2019-04-01 10:24:55.760295

Please be aware that this may be under copyright restrictions. Please send an email to admin@pharmacoengineering.com for any AI-generated issues.

Loader Loading...
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab

The content is as follows:

ResilientDistributedDatasets:AFault-TolerantAbstractionfor

In-MemoryClusterComputing

MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,

MurphyMcCauley,MichaelJ.Franklin,ScottShenker,IonStoica

UniversityofCalifornia,Berkeley

Abstract

WepresentResilientDistributedDatasets(RDDs),adis-

tributedmemoryabstractionthatletsprogrammersper-

formin-memorycomputationsonlargeclustersina

fault-tolerantmanner.RDDsaremotivatedbytwotypes

ofapplicationsthatcurrentcomputingframeworkshan-

dleinefiterativealgorithmsandinteractivedata

miningtools.Inbothcases,keepingdatainmemory

canimproveperformancebyanorderofmagnitude.

Toachievefaulttoleranceef,RDDsprovidea

restrictedformofsharedmemory,basedoncoarse-

grainedtransformationsratherthanupdates

tosharedstate.However,weshowthatRDDsareexpres-

siveenoughtocaptureawideclassofcomputations,in-

cludingrecentspecializedprogrammingmodelsforiter-

ativejobs,suchasPregel,andnewapplicationsthatthese

modelsdonotcapture.WehaveimplementedRDDsina

systemcalledSpark,whichweevaluatethroughavariety

ofuserapplicationsandbenchmarks.

1Introduction

ClustercomputingframeworkslikeMapReduce[10]and

Dryad[19]havebeenwidelyadoptedforlarge-scaledata

analytics.Thesesystemsletuserswriteparallelcompu-

tationsusingasetofhigh-leveloperators,withouthaving

toworryaboutworkdistributionandfaulttolerance.

Althoughcurrentframeworksprovidenumerousab-

stractionsforaccessingacluster’scomputationalre-

sources,theylackabstractionsforleveragingdistributed

memory.Thismakestheminefforanimportant

classofemergingapplications:thosethat

reuse

interme-

diateresultsacrossmultiplecomputations.Datareuseis

commoninmany

iterative

machinelearningandgraph

algorithms,includingPageRank,K-meansclustering,

andlogisticregression.Anothercompellingusecaseis

interactive

datamining,whereauserrunsmultiplead-

hocqueriesonthesamesubsetofthedata.Unfortu-

nately,inmostcurrentframeworks,theonlywaytoreuse

databetweencomputations(

e.g.,

betweentwoMapRe-

ducejobs)istowriteittoanexternalstablestoragesys-

tem,

e.g.,

adistributedsystem.Thisincurssubstantial

overheadsduetodatareplication,diskI/O,andserializa-

tion,whichcandominateapplicationexecutiontimes.

Recognizingthisproblem,researchershavedeveloped

specializedframeworksforsomeapplicationsthatre-

quiredatareuse.Forexample,Pregel[22]isasystemfor

iterativegraphcomputationsthatkeepsintermediatedata

inmemory,whileHaLoop[7]offersaniterativeMapRe-

duceinterface.However,theseframeworksonlysupport

computationpatterns(

e.g.,

loopingaseriesof

MapReducesteps),andperformdatasharingimplicitly

forthesepatterns.Theydonotprovideabstractionsfor

moregeneralreuse,

e.g.,

toletauserloadseveraldatasets

intomemoryandrunad-hocqueriesacrossthem.

Inthispaper,weproposeanewabstractioncalled

re-

silientdistributeddatasets(RDDs)

thatenablesef

datareuseinabroadrangeofapplications.RDDsare

fault-tolerant,paralleldatastructuresthatletusersex-

plicitlypersistintermediateresultsinmemory,control

theirpartitioningtooptimizedataplacement,andma-

nipulatethemusingarichsetofoperators.

ThemainchallengeindesigningRDDsisa

programminginterfacethatcanprovidefaulttolerance

ef

.Existingabstractionsforin-memorystorage

onclusters,suchasdistributedsharedmemory[24],key-

valuestores[25],databases,andPiccolo[27],offeran

interfacebasedonupdatestomutablestate

(

e.g.,

cellsinatable).Withthisinterface,theonlyways

toprovidefaulttolerancearetoreplicatethedataacross

machinesortologupdatesacrossmachines.Bothap-

proachesareexpensivefordata-intensiveworkloads,as

theyrequirecopyinglargeamountsofdataovertheclus-

ternetwork,whosebandwidthisfarlowerthanthatof

RAM,andtheyincursubstantialstorageoverhead.

Incontrasttothesesystems,RDDsprovideaninter-

facebasedon

coarse-grained

transformations(

e.g.,

map,

andjoin)thatapplythesameoperationtomany

dataitems.Thisallowsthemtoefprovidefault

tolerancebyloggingthetransformationsusedtobuilda

dataset(its

lineage

)ratherthantheactualdata.

1

Ifaparti-

tionofanRDDislost,theRDDhasenoughinformation

abouthowitwasderivedfromotherRDDstorecompute

1

CheckpointingthedatainsomeRDDsmaybeusefulwhenalin-

eagechaingrowslarge,however,andwediscusshowtodoitin

x

Please note all content on this page was automatically generated via our AI-based algorithm (BishopKingdom ID: 1BBnpf7Wa0BU0c67ntIp). Please let us know if you find any errors.