How to read PDF files

First, a word of caution ‚úč

Extracting text from PDF files is not a simple operation. PDF was never meant to be a format to read data from: its purpose is to provide an accurate way of reproducing documents and make them portable to any system. To further complicate the matter, PDF files can be encrypted, the text in them can actually be "printed" into an image, tables do not have a standard format, the order in which the paragraphs appear on the page might not be the same in the actual code of the PDF file, etc.

If you have access to a different file format in your automation than PDF as your data source, we recommend using that instead.

If you have no alternatives, not all hope is lost, however! In this article, we detail possible ways to get to the text contained in a PDF file.

Extracting text from PDF files

The RPA.PDF library includes tools to create and read the contents of PDF files.

Here is an example script that will extract the text in a PDF file and store it in a corresponding text file:

*** Settings ***
Documentation     An example robot to read the text contained in PDF Files
...               and save it to a corresponding text file.
Library           RPA.PDF
Library           RPA.FileSystem

*** Variables ***
${TXT_OUTPUT_DIRECTORY_PATH}=    ${CURDIR}${/}output${/}

*** Keywords ***
Extract text from PDF file into a text file
    [Arguments]    ${pdf_file_name}
    ${text}=    Get Text From Pdf    ${pdf_file_name}
    Create File    ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt
    FOR    ${page}    IN    @{text.keys()}
        Append To File    ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt
        ...    ${text[${page}]}
    END

*** Tasks ***
Extract text from PDF files
    Extract text from PDF file into a text file    invoice.pdf
    Extract text from PDF file into a text file    example_invoice.pdf

Robot script explained

*** Settings ***
Documentation     An example robot to read the text contained in PDF Files
...               and save it to a corresponding text file.
Library           RPA.PDF
Library           RPA.FileSystem

In the *** Settings *** section, we add a description of the robot, and we add the libraries we need:

  • The RPA.PDF library allows us to work with the PDF files.
  • The RPA.Filesystem library will be used to create the text files.
*** Variables ***
${TXT_OUTPUT_DIRECTORY_PATH}=    ${CURDIR}${/}output${/}

Here we are setting a variable to point to the directory where we want our text files to be created.

*** Keywords ***
Extract text from PDF file into a text file
    [Arguments]    ${pdf_file_name}
    ${text}=    Get Text From Pdf    ${pdf_file_name}
    Create File    ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt
    FOR    ${page}    IN    @{text.keys()}
        Append To File    ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt
        ...    ${text[${page}]}
    END

Here we define our own keyword Extract text from PDF file into a text file:

  1. [Arguments] ${pdf_file_name} The keyword gets the file name of a PDF file as an argument.
  2. ${text}= Get Text From Pdf ${pdf_file_name} We extract the text from the PDF file using the Get Text From Pdf keyword provided by the RPA.PDF library. Because the keyword returns a dictionary with items for each page of the pdf file, we will need to go over the results with a for loop.
  3. Create File ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt We create a new empty text file that we will fill with the text later.
  4. We loop over the pages of extracted text. For each of the pages, we call the Append To File keyword to add the text to our file.
    FOR    ${page}    IN    @{text.keys()}
        Append To File    ${TXT_OUTPUT_DIRECTORY_PATH}${pdf_file_name}.txt
        ...    ${text[${page}]}
    END
    

Finally, in the *** Tasks *** section, we create a task and call the keyword for our PDF files:

*** Tasks ***
Extract text from PDF files
    Extract text from PDF file into a text file    example-file.pdf
    Extract text from PDF file into a text file    example-invoice.pdf

Results

The text extraction quality will vary greatly, depending on the document: how it was created, how it was formatted, etc.

Here are some examples:

Source PDFResulting text file
simple-text-example.pdfsimple-text-example.txt
example-invoice.pdfexample-invoice.txt

Alternative approaches

The text extraction capabilities of the RPA.PDF library are for now quite limited. If you need higher precision, and the capability to extract tables from invoices and reports, we recommend connecting your robot to an external extraction service.

You can see an example using the Amazon Textract service in our Cloud machine learning (ML) APIs, where we extract a table from a test invoice.