Can the text be extracted from a PDF and stored in XML format?

Estimated Reading Time: 1 Minutes

The Adobe PDF Library can extract text from a PDF document in two ways. You can enumerate the Content of a page, which will give you all text, graphics and path elements in the order they were placed in the page content stream. Or you can use a WordFinder utility to return a list of Words, either in the order seen or in (presumed) reading order. The list of words includes appearance style information.

Note that the text of a PDF document can be placed into the Content stream of the page in any order when it is being assembled. There is no requirement for the creation application to lay down the text of a page from top to bottom, from left to right, or in the order that you would read it. As a result, it may be difficult to interpret this content back into text the way that someone reading the document on screen might expect. Although the WordFinder does a good job of separating words, it is still sometimes wrong.

When enumerating the Text content of a page, your application needs to duplicate the functionality of the WordFinder, so you cannot expect perfectly accurate exports for all documents.

Your application can store the resulting text streams in a variety of ways, but it will be difficult to identify constructs like sentences and paragraphs. What the eye recognizes as a logical arrangement of text is not always how a parsing or WordFinder application will present it. Writing the extracted text as XML (or any other format) needs to be handled by your application.