Processes Randomly Generated PDF Invoices With Amazon Textract

PDF - The most machine-readable document format ever! Right? 🙈

Extracting text from PDF files is not a simple operation. PDF was never meant to be a format to read data from: its purpose is to provide an accurate way of reproducing documents and make them portable to any system. - How to read PDF files with RPA Framework!

Still, it is possible to automatically read and extract invoice data from PDF documents and save the data to an Excel file. No more manual copy & pasting!

This robot processes randomly generated PDF invoices with Amazon Textract and saves the extracted invoice data in an Excel file.

Example PDF invoice

Example Excel

Tasks

The robot provides three tasks:

  • Create Invoices
  • Process PDF invoices with Amazon Textract
  • Delete Files From Amazon S3 Bucket

Create Invoices

  • Generates random PDF invoices and uploads them to Amazon S3 bucket.
  • Saves the generated PDF invoices to the output directory for debugging purposes.

Process PDF invoices with Amazon Textract

  • Reads the invoices from the Amazon S3 bucket.
  • Processes the invoices with Amazon Textract.
  • Saves the extracted invoice data in an Excel file in the output directory.
  • Finally, deletes the PDF invoices from the Amazon S3 bucket.

Delete Files From Amazon S3 Bucket

  • A utility task for deleting the PDF invoices from the Amazon S3 bucket.
  • Can be executed separately when you want to empty the Amazon S3 bucket.
  • Called by the Process PDF invoices with Amazon Textract task in the teardown phase.

Prerequisites

Amazon API key and key ID with access to Amazon S3 and Amazon Textract

The robot requires access to Amazon S3 and Amazon Textract services. It needs an API key, key ID, and the AWS region. Check out Amazon Textract Developer Guide!

Store the API key, key ID, and the AWS region in Robocorp Vault

Set up Robocorp Vault either locally or in Robocorp Cloud.

For a local run, use the following configuration:

/Users/username/vault.json:

{
  "aws": {
    "AWS_KEY": "aws-key",
    "AWS_KEY_ID": "aws-key-id",
    "AWS_REGION": "us-east-1"
  }
}

devdata/env.json:

{
  "RPA_SECRET_MANAGER": "RPA.Robocloud.Secrets.FileSecrets",
  "RPA_SECRET_FILE": "/Users/username/vault.json"
}

For Robocorp Cloud run, create a new vault entry in Robocorp Cloud.

  • Enter aws as the name.
  • Provide values for the AWS_KEY, AWS_KEY_ID, and AWS_REGION keys:

Running

  1. Run the Create Invoices task to create the PDF invoices.
  2. Run the Process PDF invoices with Amazon Textract task to process the PDF invoices and to generate the Excel file with the data extracted from the invoices.

Optional: Run the Delete Files From Amazon S3 Bucket task if you want to delete the PDF invoices from the Amazon S3 bucket (the Process PDF invoices with Amazon Textract task does this automatically in the teardown phase).

When running in Robocorp Cloud, add the Create Invoices and Process PDF invoices with Amazon Textract as process steps:

Further reading

Discover more robots