PDF (Portable Data Format), developed by Adobe, has become widely used over the years. Earlier, we needed special converters like Adobe PDF Creator to create PDF files from native documents like Word, PowerPoint, etc.
Over a period of time, things became simpler, due to the availability of ‘PDF pinters’ – printer drivers that generate PDF output – such as PDFCreator. PDF is now so ubiquitous that applications like Word and PowerPoint provide native converters.
Recently, I took on the task of extracting specific information from a PDF. You may be asking as to why can information not be extracted directly from the source? The simple reason is that the source was not available.
For this purpose, I looked at a couple of options, namely Apache PDFBox, XPDF and iText, to name a few. You may notice that all these tools are Open Source in nature. That is the reason why I chose to look at them.
While using these tools, I made a few interesting observations that I will share in the following blogs.