Handling PDF files
The Portable Document Format (PDF) has become ubiquitous in our daily life, and countless business processes rely on manipulating PDF files for reports, invoices, and a variety of other documents. This, in turn, means that learning how to manipulate PDF files is a very important skill to master for Software Robot Developers.
With the Robocorp stack, PDF operations are performed using the RPA.PDF library, part of RPA Framework.
⚠️ Keep in mind that this library works with text-based PDFs, and it can't extract information from an image-based (scan) PDF file. For accurate results, you have to use specialized external services wrapped by the RPA.DocumentAI library.
Using the keywords provided by the RPA.PDF library, you can create PDF files in multiple ways:
- Creating PDF files starting from an HTML template: This method allows to create PDF files based on an HTML template and a set of data. For an example, check out the PDF invites creator robot example.
- Converting HTML content into a PDF file: This can be achieved as a special case of the above. To see this approach at work, you can check the PDF creation chapter of the Beginners' course.
- Creating the PDF file from scratch: The
RPA.PDFlibrary also includes the fpdf2 Python library to enable more advanced and fine-tuned ways of creating PDF files. Refer to the fpdf2 documentation for more information about the usage and the options available.
PDF files can contain forms that users can fill using a desktop program like Acrobat Reader or Preview on macOS. Using the RPA.PDF library, you can automate this operation. See how in the how to fill PDF forms article.
Extracting text and data from PDF files is not a simple operation, mostly because this was not the intended use case for the PDF file formats. If possible, using PDF files as a source of data should be avoided. If you absolutely must (😀), you can see a possible approach in the how to read PDF files article.