Of the three tools I considered (Apache PDFBox, XPDF and iText), my first choice was Apache PDFBox. iText is a port of PDFBox, implemented on .NET platform. As I did not have access to the .NET development platform, the decision was out of my hands.

I looked at PDFBox as I had used it in the past, to extract pages from a PDF. But, I had used it in command line mode. Considering that PDF is a binary format, it was not possible for me to simply drop the file into Notepad++ and look at it’s structure. Hence I converted the PDF into a text file. After converting the PDF to an equivalent text file, I looked at the output and was in for a surprise!!! The structure of the generated text file did not have any similarity with that of the PDF. Text that appears next to each other in PDF appears separated by multiple lines, in the generated text file.

It was then I came to know that PDF files do not have a specific structure. PDF is a display format and uses coordinates to place elements on the display as well as the printer. Hence, when the text is extracted, the order of elements may not be preserved. Having said that, information that appears on page 1 of a PDF continues to appear as page 1 in the extracted text. I am posting an image of the sample layout created in Powerpoint and an equivalent output generated by PDFCreator.


Layout created in Powerpoint


PDF generated by PDFCreator

I will post the generated text files in the following posts.