Tags

, , , ,

On generating a text file from the PDF, I got interesting results. The generated text was not in the same format as it appears in the PDF. While PDFBox allows standard extraction of text, XPDF takes parameters and generates text in various layouts. The ‘standard’ or no argument execution of XPDF generates text arranged almost the same as that generated by PDFBox. I am including a few examples of the generated text.

pdf-standard

PDF in standard format as generated by XPDF. PDFBox also generates text in similar arrangement (minor differences do exist)

pdf-lineprinter

PDF in line printer format as generated by XPDF

pdf-raw

PDF in raw format as generated by XPDF

pdf-layout

PDF in layout preserving format as generated by XPDF

pdf-table

PDF in tablular format as generated by XPDF

As you can see from the examples, XPDF appears to be a versatile tool when extracting text from a PDF as well as preserving the layout to some extent. But, Apache PDFBox is equally feature-rich. The biggest difference between the two is that XPDF uses GPL 2.0 while PDFBox uses Apache License 2.0.

Links

Advertisements