Skip to Content

Can the text be extracted from a PDF and stored in XML format?

Estimated Reading Time: 1 Minutes

The Library can extract text from a PDF document in two ways. You can enumerate the content of a page, which will give you all content, text and graphics, in the order it was specified. Or you can use a WordFinder utility to return a list of words, either in order seen or in (presumed) reading order. The list of words includes appearance style information.

The words can be put into the document in any order when it is being assembled, however. As a result it may be difficult to interpret this content back into text. There is no requirement for an application to lay down the text of a page from top to bottom, from one side to the other, or even in the order that you would read it, and so although WordFinder does a fairly good job of separating words, it is still sometimes wrong.

When enumerating the content of a page, your application needs to to duplicate the functionality of the WordFinder , so you cannot expect perfectly accurate exports for all documents.

Your application can store the resulting text streams in a variety of ways, but it will be difficult to identify constructs like sentences and paragraphs. What the eye recognizes as a logical arrangement of text is not always how a parsing or WordFinder application will present it. Writing the extracted text as XML (or any other format) would need to be handled by your application.

Can the text be extracted from a PDF and stored in XML format?
  • COMMENT