Extracting text and data from PDF files with UniPDF

UniPDF contains a powerful package called extractor which is designed to facilitate common extraction operations for PDF files. This enables extracting text and images. In addition there is a powerful functionality for extracting tables from PDF files.

Initializing the extractor package to get page contents

PDF files are organized by pages. Similarly, the extractor package is designed around extracting contents on a page-by-page basis.  To use it, we load a page model, as follows:

ex, err := extractor.New(page)
if err != nil {
    return err
}

The primary functions that the Extractor model has are:

  • ExtractText() which returns a string, i.e. purely the textual content of the page in a single string.
  • ExtractPageImages() returns the image contents of the page extractor, including data and position, size information for each image
  • ExtractPageText() returns *PageText which is a high level object representing the text on the page, including information about formatting and positions.

Further, the *PageText object has some useful methods:

  • Tables() returns the detected text tables on the page
  • Marks() returns the low level TextMarks which are basically all the textual marks on the page.  Oftentimes one mark can denote a single character (glyph) on a page.  It enables some low level processing.

Extracting Page as a Pure Text

With the page extractor initialized as above, we can proceed to get the page contents as:

text, err := ex.ExtractText()
if err != nil {
  return err
}
fmt.Printf("Page text: %s\n", text)

This is easy and yet can be quite powerful.  The extraction algorithm used has been optimized to work well on multi-column pages and pages with complex layout.  We continously work on improving this.

The following playground example showcases the text extraction on a real PDF. Try to open it and upload your own file to test.

This yields the text, as shown in the output console:

Extracting Text with Position and Formatting Style

Formatting and position information can be used to enable more powerful inference, such as detecting headings, normal text and other structure of the contents.

The following examples can serve as a good starting point for using UniPDF to work with the positional and style information of the text contents:

Extracting Tables from PDF

The extractor package also provides a high level function for extracting tables from PDF.  This is based on an algorithm that analyzes the text positions, spacings, and rulings to automatically detect tables and collects the text.

The following example in our GitHub repository illustrates how to use this functionality for automatically extracting and outputting tables to CSV files.

Note that this algorithm is more detailed than the pdf_to_csv.go which simply attempts to automatically divide the page into grids (like a spreadsheet). The pdf_tables.go automatically segments and can handle multiple tables on the same page.