On generating a text file from the PDF, I got interesting results. The generated text was not in the same format as it appears in the PDF. While PDFBox allows standard extraction of text, XPDF takes parameters and generates text in various layouts. The ‘standard’ or no argument execution of XPDF generates text arranged almost the same as that generated by PDFBox. I am including a few examples of the generated text.
PDF in standard format as generated by XPDF. PDFBox also generates text in similar arrangement (minor differences do exist)
As you can see from the examples, XPDF appears to be a versatile tool when extracting text from a PDF as well as preserving the layout to some extent. But, Apache PDFBox is equally feature-rich. The biggest difference between the two is that XPDF uses GPL 2.0 while PDFBox uses Apache License 2.0.
- The PDF Saga – part 1, https://twentymegahertz.wordpress.com/2016/08/21/the-pdf-saga-part-1/
- The PDF Saga – part 2, https://twentymegahertz.wordpress.com/2016/09/06/the-pdf-saga-part-2/