Fixing Extra Spaces in Text Extraction from PDF Documents

The Details

Users may encounter an issue when utilizing the Extract Text API where the output contains problematic spaces after some special characters. This situation can lead to a less accurate representation of the text content within the PDF documents and may hinder the functionality of applications that rely on clean text extraction for further processing, such as data redaction.

The problem is particularly evident when the Extract Text API is applied to documents containing various special characters. These can include punctuation marks, symbols, and other non-alphanumeric characters that are frequently used in natural language processing. Users have reported that after the extraction process, the output includes additional spaces that were not present in the original PDF.

This problem highlights a broader challenge in text extraction from PDFs: the inherent complexity of the PDF format and the difficulty in accurately discerning which characters should be treated as word boundaries.

Input File Behavior

It is important to understand the behavior of the input files that may lead to this issue. The input files typically contain structured text that includes both regular words and special characters. Given the nature of PDF documents, the way text is encoded can vary significantly. For example, a PDF document may feature phone numbers formatted in various ways, such as:

456-7890
456.7890
456 7890

During the extraction process, this may lead to the insertion of unwanted spaces after the special characters (like hyphens and periods). This results in outputs that do not mirror the original visual formatting. Furthermore, the raw extracted text may not contain inherent line breaks or spaces, complicating the process of maintaining the intended structure and readability of the information.

Summary

One key aspect to recognize is that the implementation of text extraction may need adjustments to better handle the intricacies of word boundaries as defined by the nature of the text itself. Starting with the APDFL code samples that demonstrate text extraction (TextExtract, ListWords) and the handling of line breaks and special characters, your application will need to adjust the implementation code to account for how words are defined in your document set and recognize adjacent characters that should not have spaces in between them. Each Word returned by the WordFinder can be checked for attributes such as AdjacentToSpace, HasSoftHyphen, IsLastWordInRegion and LastWordOnLine.