The number of pages within the document is: 9
The self-declared author(s) is/are:
Tamir Hassan
The subject is as follows:
Original authors did not specify.
The original URL is: LINK
The access date was:
2019-02-10 16:33:55.126585
Please be aware that this may be under copyright restrictions. Please send an email to admin@pharmacoengineering.com for any AI-generated issues.
The content is as follows:
Object-LevelDocumentAnalysisofPDFFiles TamirHassan DatabaseandArIntelligenceGroup InformationSystemsInstitute TechnischeUniversitätWien Favoritenstraße9-11,A-1040Wien,Austria hassan@dbai.tuwien.ac.atABSTRACT ThePDFformatiscommonlyusedfortheexchangeofdoc- umentsontheWebandthereisagrowingneedtounder- standandextractorrepurposedataheldinPDFdocuments. ManysystemsforprocessingPDF¯lesusealgorithmsde- signedforscanneddocuments,whichanalyseapagebased onitsbitmaprepresentation.Webelievethisapproachto beine±cient.Notonlydoestherasterizationstepcostpro- cessingtime,butinformationisalsolostanderrorscanbe introduced. Inspiredprimarilybytheneedtofacilitatemachineex- tractionofdatafromPDFdocuments,wehavedeveloped methodstoextracttextualandgraphiccontentdirectlyfrom thePDFcontentstreamandrepresentitasalistof\objects” atalevelofgranularitysuitableforstructuralunderstand- ingofthedocument.Theseobjectsarethengroupedinto lines,paragraphsandhigher-levellogicalstructuresusing anovelbottom-upsegmentationalgorithmbasedonvisual perceptionprinciples.Experimentalresultsdemonstratethe viabilityofourapproach,whichiscurrentlyusedasabasis forHTMLconversionanddataextractionmethods. CategoriesandSubjectDescriptors I.7.5[ DocumentandTextProcessing ]:DocumentCap- ture| documentanalysis ;H.3.3[ InformationSystems ]:InformationSearchandRetrieval GeneralTerms Algorithms,Experimentation 1.INTRODUCTION Inrecentyears,PDFhasbecomethe defacto standardfor exchangingprint-orienteddocumentsontheWeb.Itspop- ularitycanbeattributedtoitsrootsasapage-description language.AnydocumentcanbeconvertedtoPDFaseasily assendingittotheprinter,withthecon¯dencethatthe formattingandlayoutwillbepreservedwhenitisviewed Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare notmadeordistributedfororcommercialadvantageandthatcopies bearthisnoticeandthefullcitationonthepage.Tocopyotherwise,to republish,topostonserversortoredistributetolists,requirespriorpermissionand/orafee. DocEng’09,September 16Œ18, 2009, Munich, Germany. Copyright 2009 ACM 978-1-60558-575-8/09/09 …$10.00. orprintedacrossdi®erentcomputingplatforms.However, theprint-orientednatureofPDFalsoprovidesasigni¯cant drawback:PDFscontainverylittlestructuralinformation aboutthecontentheldwithinthem,andextractingorre- purposingthiscontentisthereforeadi±culttask. Inthelastfewdecades,therehasbeenmuchworkinthe ¯eldof documentunderstanding whichaimstodetectlogi- calstructureinunstructuredrepresentationsofdocuments; usuallyscannedimages.Manyoftheseapproacheshavealso beenappliedtoPDF.Manyofthesemethodssimplymake useofabitmaprenditionofeachpageofthePDF¯leata givenresolutionandapplymethodssimilartothosedesigned forscannedpages.Relativelylittleinformation,typically justthetext,isusedfromtheoriginalPDFsource.Other approachesdoexaminethePDFsourcecodebutmakesome- whatlimiteduseofthedata.Section2describestheseap- proachesinmoredetail. Figure1:Documentrepresentationhierarchy Fig.1givesanoverviewofthedocumentauthoringpro- cessandthevariouslevelsofabstractioninwhichadocu- mentisrepresentedduringthedocumentauthoringprocess; fromsemanticconceptsbeforeanywordshavebeenwritten atthestarttotheprintedimageofthepageattheend. Documentunderstandingisessentiallytheoppositeofdoc- umentauthoring.WebelievethePDFrepresentationtobe alogicalstep above theprintedpage 1,andthereforethat 1PleasenotethatwearereferringtoPDF¯leswhichhave beengenerateddigitally,usuallydirectlyfromaDTPor word-processingapplication,regardlessofwhethertheyare tagged ornot.MoreinformationisgiveninSection6.
Please note all content on this page was automatically generated via our AI-based algorithm (wjpm3G6maCqtb9toThKJ). Please let us know if you find any errors.