PDF Data Extraction

Posted: 7th August 2010 by davy in Development, Java

One of my clients was looking to improve a very large database of PDF files they have.  I cant talk too much about it but what I can describe is the kind of development work I did. PDF files are a pretty neat idea for companies with documents that need to be controlled.  PDF is more of a container – a mechanism that allows the author of the content to keep it under control when its issued.  You can go as far as making sure the consumer cannot change any of the content. 

But my client had this large collection of PDF files that were in a huge variety of formats.  The main requirment was to try and leverage this as a resource.  So I wrote some C++, C# and Java tools that made up a PDF processing factory.  I used the superb XPDF as a platform to extract the data.

I then built a WordPress site to host the new PDF files in a new format which was HTML.

This allowed the PDF files to be viewed in their original formt – but they could easily be accessed as an intergarted Web page.  I also extracted SEO data and added that to the WP search engine.  I also did another version of this site using MediaWiki – the platform on which wikipedia is based.   I chnaged the search engine to Lucene and was able to load up millions of PDF files converted to HTML for fast searching and tagging.

I also added a “headline” extract  tool which looked at the PDF file text and worked out what was useful information.  That became the excerpt for a WP Post.  You could then browse millions of PDF files using tag and categories and read about – essentially taking the PDF data out of its proprietary container and making it participate.

This would have been an easy challenge had the PDF files beeen produced by the same authoring program and PDF producer application.  They werent!  So I had to put in almost AI techniques to identfy raw PDF data as something useful.  I also used a technique that was detailed in this academic paper here.

PDF24    Send article as PDF   

You must be logged in to post a comment.