Here is your PDF: Object-level document analysis of PDF files; Keywords: object-leveldocumentanalysisofpdffiles tamirhassan databaseandarintelligencegroup informationsystemsinstitute technischeuniversitätwien favoritenstraße9-11,a-1040wien,austria hassan@dbai.tuwien.ac.atabstract

The number of pages within the document is: 9

The self-declared author(s) is/are:
Tamir Hassan

The subject is as follows:
Original authors did not specify.

The original URL is: LINK

The access date was:
2019-02-10 16:33:55.126585

Please be aware that this may be under copyright restrictions. Please send an email to admin@pharmacoengineering.com for any AI-generated issues.

Loader Loading...
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab

The content is as follows:
Object-LevelDocumentAnalysisofPDFFiles TamirHassan DatabaseandArIntelligenceGroup InformationSystemsInstitute TechnischeUniversitätWien Favoritenstraße9-11,A-1040Wien,Austria hassan@dbai.tuwien.ac.atABSTRACT ThePDFformatiscommonlyusedfortheexchangeofdoc- umentsontheWebandthereisagrowingneedtounder- standandextractorrepurposedataheldinPDFdocuments. ManysystemsforprocessingPDF¯lesusealgorithmsde- signedforscanneddocuments,whichanalyseapagebased onitsbitmaprepresentation.Webelievethisapproachto beine±cient.Notonlydoestherasterizationstepcostpro- cessingtime,butinformationisalsolostanderrorscanbe introduced. Inspiredprimarilybytheneedtofacilitatemachineex- tractionofdatafromPDFdocuments,wehavedeveloped methodstoextracttextualandgraphiccontentdirectlyfrom thePDFcontentstreamandrepresentitasalistof\objects” atalevelofgranularitysuitableforstructuralunderstand- ingofthedocument.Theseobjectsarethengroupedinto lines,paragraphsandhigher-levellogicalstructuresusing anovelbottom-upsegmentationalgorithmbasedonvisual perceptionprinciples.Experimentalresultsdemonstratethe viabilityofourapproach,whichiscurrentlyusedasabasis forHTMLconversionanddataextractionmethods. CategoriesandSubjectDescriptors I.7.5[ DocumentandTextProcessing ]:DocumentCap- ture| documentanalysis ;H.3.3[ InformationSystems ]:InformationSearchandRetrieval GeneralTerms Algorithms,Experimentation 1.INTRODUCTION Inrecentyears,PDFhasbecomethe defacto standardfor exchangingprint-orienteddocumentsontheWeb.Itspop- ularitycanbeattributedtoitsrootsasapage-description language.AnydocumentcanbeconvertedtoPDFaseasily assendingittotheprinter,withthecon¯dencethatthe formattingandlayoutwillbepreservedwhenitisviewed Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare notmadeordistributedfororcommercialadvantageandthatcopies bearthisnoticeandthefullcitationonthepage.Tocopyotherwise,to republish,topostonserversortoredistributetolists,requirespriorpermissionand/orafee. DocEng’09,September 16Œ18, 2009, Munich, Germany. Copyright 2009 ACM 978-1-60558-575-8/09/09 …$10.00. orprintedacrossdi®erentcomputingplatforms.However, theprint-orientednatureofPDFalsoprovidesasigni¯cant drawback:PDFscontainverylittlestructuralinformation aboutthecontentheldwithinthem,andextractingorre- purposingthiscontentisthereforeadi±culttask. Inthelastfewdecades,therehasbeenmuchworkinthe ¯eldof documentunderstanding whichaimstodetectlogi- calstructureinunstructuredrepresentationsofdocuments; usuallyscannedimages.Manyoftheseapproacheshavealso beenappliedtoPDF.Manyofthesemethodssimplymake useofabitmaprenditionofeachpageofthePDF¯leata givenresolutionandapplymethodssimilartothosedesigned forscannedpages.Relativelylittleinformation,typically justthetext,isusedfromtheoriginalPDFsource.Other approachesdoexaminethePDFsourcecodebutmakesome- whatlimiteduseofthedata.Section2describestheseap- proachesinmoredetail. Figure1:Documentrepresentationhierarchy Fig.1givesanoverviewofthedocumentauthoringpro- cessandthevariouslevelsofabstractioninwhichadocu- mentisrepresentedduringthedocumentauthoringprocess; fromsemanticconceptsbeforeanywordshavebeenwritten atthestarttotheprintedimageofthepageattheend. Documentunderstandingisessentiallytheoppositeofdoc- umentauthoring.WebelievethePDFrepresentationtobe alogicalstep above theprintedpage 1,andthereforethat 1PleasenotethatwearereferringtoPDF¯leswhichhave beengenerateddigitally,usuallydirectlyfromaDTPor word-processingapplication,regardlessofwhethertheyare tagged ornot.MoreinformationisgiveninSection6.

Please note all content on this page was automatically generated via our AI-based algorithm (wjpm3G6maCqtb9toThKJ). Please let us know if you find any errors.