The length of the document below is: 14 page(s) long
The self-declared author(s) is/are:
The subject is as follows:
Subject: Original authors did not specify.
The original URL is: LINK
The access date was:
Access date: 2019-04-01 10:24:55.760295
Please be aware that this may be under copyright restrictions. Please send an email to admin@pharmacoengineering.com for any AI-generated issues.
The content is as follows:
ResilientDistributedDatasets:AFault-TolerantAbstractionfor
In-MemoryClusterComputing
MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,
MurphyMcCauley,MichaelJ.Franklin,ScottShenker,IonStoica
UniversityofCalifornia,Berkeley
Abstract
WepresentResilientDistributedDatasets(RDDs),adis-
tributedmemoryabstractionthatletsprogrammersper-
formin-memorycomputationsonlargeclustersina
fault-tolerantmanner.RDDsaremotivatedbytwotypes
ofapplicationsthatcurrentcomputingframeworkshan-
dleinefiterativealgorithmsandinteractivedata
miningtools.Inbothcases,keepingdatainmemory
canimproveperformancebyanorderofmagnitude.
Toachievefaulttoleranceef,RDDsprovidea
restrictedformofsharedmemory,basedoncoarse-
grainedtransformationsratherthanupdates
tosharedstate.However,weshowthatRDDsareexpres-
siveenoughtocaptureawideclassofcomputations,in-
cludingrecentspecializedprogrammingmodelsforiter-
ativejobs,suchasPregel,andnewapplicationsthatthese
modelsdonotcapture.WehaveimplementedRDDsina
systemcalledSpark,whichweevaluatethroughavariety
ofuserapplicationsandbenchmarks.
1Introduction
ClustercomputingframeworkslikeMapReduce[10]and
Dryad[19]havebeenwidelyadoptedforlarge-scaledata
analytics.Thesesystemsletuserswriteparallelcompu-
tationsusingasetofhigh-leveloperators,withouthaving
toworryaboutworkdistributionandfaulttolerance.
Althoughcurrentframeworksprovidenumerousab-
stractionsforaccessingacluster’scomputationalre-
sources,theylackabstractionsforleveragingdistributed
memory.Thismakestheminefforanimportant
classofemergingapplications:thosethat
reuse
interme-
diateresultsacrossmultiplecomputations.Datareuseis
commoninmany
iterative
machinelearningandgraph
algorithms,includingPageRank,K-meansclustering,
andlogisticregression.Anothercompellingusecaseis
interactive
datamining,whereauserrunsmultiplead-
hocqueriesonthesamesubsetofthedata.Unfortu-
nately,inmostcurrentframeworks,theonlywaytoreuse
databetweencomputations(
e.g.,
betweentwoMapRe-
ducejobs)istowriteittoanexternalstablestoragesys-
tem,
e.g.,
adistributedsystem.Thisincurssubstantial
overheadsduetodatareplication,diskI/O,andserializa-
tion,whichcandominateapplicationexecutiontimes.
Recognizingthisproblem,researchershavedeveloped
specializedframeworksforsomeapplicationsthatre-
quiredatareuse.Forexample,Pregel[22]isasystemfor
iterativegraphcomputationsthatkeepsintermediatedata
inmemory,whileHaLoop[7]offersaniterativeMapRe-
duceinterface.However,theseframeworksonlysupport
computationpatterns(
e.g.,
loopingaseriesof
MapReducesteps),andperformdatasharingimplicitly
forthesepatterns.Theydonotprovideabstractionsfor
moregeneralreuse,
e.g.,
toletauserloadseveraldatasets
intomemoryandrunad-hocqueriesacrossthem.
Inthispaper,weproposeanewabstractioncalled
re-
silientdistributeddatasets(RDDs)
thatenablesef
datareuseinabroadrangeofapplications.RDDsare
fault-tolerant,paralleldatastructuresthatletusersex-
plicitlypersistintermediateresultsinmemory,control
theirpartitioningtooptimizedataplacement,andma-
nipulatethemusingarichsetofoperators.
ThemainchallengeindesigningRDDsisa
programminginterfacethatcanprovidefaulttolerance
ef
.Existingabstractionsforin-memorystorage
onclusters,suchasdistributedsharedmemory[24],key-
valuestores[25],databases,andPiccolo[27],offeran
interfacebasedonupdatestomutablestate
(
e.g.,
cellsinatable).Withthisinterface,theonlyways
toprovidefaulttolerancearetoreplicatethedataacross
machinesortologupdatesacrossmachines.Bothap-
proachesareexpensivefordata-intensiveworkloads,as
theyrequirecopyinglargeamountsofdataovertheclus-
ternetwork,whosebandwidthisfarlowerthanthatof
RAM,andtheyincursubstantialstorageoverhead.
Incontrasttothesesystems,RDDsprovideaninter-
facebasedon
coarse-grained
transformations(
e.g.,
map,
andjoin)thatapplythesameoperationtomany
dataitems.Thisallowsthemtoefprovidefault
tolerancebyloggingthetransformationsusedtobuilda
dataset(its
lineage
)ratherthantheactualdata.
1
Ifaparti-
tionofanRDDislost,theRDDhasenoughinformation
abouthowitwasderivedfromotherRDDstorecompute
1
CheckpointingthedatainsomeRDDsmaybeusefulwhenalin-
eagechaingrowslarge,however,andwediscusshowtodoitin
x
Please note all content on this page was automatically generated via our AI-based algorithm (BishopKingdom ID: 1BBnpf7Wa0BU0c67ntIp). Please let us know if you find any errors.