Text Extraction: detection of whitespace

Estimated Reading Time: 1 Minutes

81 May 28, 2025 APDFL: Text 0

APDFL's text extraction API also known as the WordFinder extracts Words which do not always match up to the common language definition of words, and it's easy to assume that these words will be separated by whitespace and end-of-line characters, but that's not always the case.

Text strings in PDF Content can be arbitrarily placed on the page and are not necessarily separated by any whitespace characters, And in fact, words may be broken up into several text strings for kerning purposes. Also words may be comprised of numeric and/or punctuation sequences that are not whitespace-separated.

Consequently, it is important to pay attention to the attributes associated with the detected words in order extract not just the character sequences detected in the content, but also synthetic characters that are not technically present in the content but are nonetheless implied by positioning.

Using the C/C++ API, the relevant code in: ExtractText.cpp

Using the .Net[framework] API, the relevant code in: TextExtract.cs

Using the Java API, the relevant code in: TextExtract.java