RPA.PDF
Add images and/or pdfs to new PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
files | list, None | None | list of filepaths to add into PDF (can be either images or PDFs) |
target_document | str, None | None | filepath of target PDF |
append | bool | False | appends files to existing document if append is True |
Supports merging and splitting PDFs.
Image formats supported are JPEG, PNG and GIF.
The file can be added with extra properties by denoting : at the end of the filename. Each property should be separated by comma.
Supported extra properties for PDFs are:
- page and/or page ranges
- no extras means that all source PDF pages are added into new PDF
Supported extra properties for images are:
- format, the PDF page format, for example. Letter or A4
- rotate, how many degrees image is rotated counter-clockwise
- align, only possible value at the moment is center
- orientation, the PDF page orientation for the image, possible values P (portrait) or L (landscape)
- x/y, coordinates for adjusting image position on the page
Examples
Robot Framework
Python
param files: | list of filepaths to add into PDF (can be either images or PDFs) |
---|---|
param target_document: | |
filepath of target PDF | |
param append: | appends files to existing document if append is True |
Add an image into an existing or new PDF.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
image_path | str, Path | null | filepath to image file to add into PDF |
output_path | str, Path | null | filepath of target PDF |
source_path | str, Path, None | None | |
coverage | float | 0.2 | how the watermark image should be scaled on page, defaults to 0.2 |
If no source path is given, assume a PDF is already opened.
Examples
Robot Framework
Python
param image_path: | |
---|---|
filepath to image file to add into PDF | |
param source: | filepath to source, if not given add image to currently active PDF |
param output_path: | |
filepath of target PDF | |
param coverage: | how the watermark image should be scaled on page, defaults to 0.2 |
Close all opened PDF file descriptors.
Examples
Robot Framework
Close PDF file descriptor for a certain file.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_pdf | str, None | None | filepath to the source pdf. |
Examples
Robot Framework
param source_pdf: | |
---|---|
filepath to the source pdf. | |
raises ValueError: | |
if file descriptor for the file is not found. |
Parse source PDF into entities.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | source PDF filepath |
trim | bool | True | trim whitespace from the text is set to True (default) |
pagenum | str, int, None | None | Page number where search is performed on, defaults to None. (meaning all pages get converted -- numbers start from 1) |
These entities can be used for text searches or XML dumping for example. The conversion will be done automatically when using the dependent keywords directly.
param source_path: | |
---|---|
source PDF filepath | |
param trim: | trim whitespace from the text is set to True (default) |
param pagenum: | Page number where search is performed on, defaults to None. (meaning all pages get converted -- numbers start from 1) |
Examples
Robot Framework
Python
Decrypt PDF with password.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str | null | filepath to the source pdf. |
output_path | str | null | filepath to the decrypted pdf. |
password | str | null | password as a string. |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
param output_path: | |
filepath to the decrypted pdf. | |
param password: | password as a string. |
return: | True if decrypt was successful, else False or Exception. |
raises ValueError: | |
on decryption errors. |
Get PDFMiner format XML dump of the PDF
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source PDF |
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source PDF | |
return: | XML content as a string |
Encrypt a PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf. |
output_path | str, None | None | filepath to the target pdf, stored by default in the robot output directory as output.pdf |
user_pwd | str | allows opening and reading PDF with restrictions. | |
owner_pwd | str, None | None | allows opening PDF without any restrictions, by default same user_pwd. |
use_128bit | bool | True | whether to 128bit encryption, when false 40bit encryption is used, default True. |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
param output_path: | |
filepath to the target pdf, stored by default in the robot output directory as output.pdf | |
param user_pwd: | allows opening and reading PDF with restrictions. |
param owner_pwd: | |
allows opening PDF without any restrictions, by default same user_pwd. | |
param use_128bit: | |
whether to 128bit encryption, when false 40bit encryption is used, default True. |
Extract pages from source PDF and save to a new PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf. |
output_path | str, None | None | filepath to the target pdf, stored by default in the robot output directory as output.pdf |
pages | int, str, List[int], List[str], None | None | page numbers to extract from PDF (numbers start from 1) if None then extracts all pages. |
Page numbers start from 1.
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
param output_path: | |
filepath to the target pdf, stored by default in the robot output directory as output.pdf | |
param pages: | page numbers to extract from PDF (numbers start from 1) if None then extracts all pages. |
Find the closest text elements near the set anchor(s) through locator.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
locator | str | null | Element to set anchor to. This can be prefixed with either "text:", "subtext:", "regex:" or "coords:" to find the anchor by text or coordinates. The "text" strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it) |
pagenum | int, str | 1 | Page number where search is performed on, defaults to 1 (first page). |
direction | str | right | In which direction to search for text elements. This can be any of 'top'/'up', 'bottom'/'down', 'left' or 'right'. (defaults to 'right') |
closest_neighbours | str, int, None | 1 | How many neighbours to return at most, sorted by the distance from the current anchor. |
strict | bool | False | If element's margins should be used for matching those which are aligned to the anchor. (turned off by default) |
regexp | str, None | None | Expected format of the searched text value. By default all the candidates in range are considered valid neighbours. |
trim | bool | True | Automatically trim leading/trailing whitespace from the text elements. (switched on by default) |
ignore_case | bool | False | Do a case-insensitive search when set to True. (affects the passed locator and regexp filtering) |
The PDF will be parsed automatically before elements can be searched.
param locator: | Element to set anchor to. This can be prefixed with either "text:", "subtext:", "regex:" or "coords:" to find the anchor by text or coordinates. The "text" strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it) |
---|---|
param pagenum: | Page number where search is performed on, defaults to 1 (first page). |
param direction: | |
In which direction to search for text elements. This can be any of 'top'/'up', 'bottom'/'down', 'left' or 'right'. (defaults to 'right') | |
param closest_neighbours: | |
How many neighbours to return at most, sorted by the distance from the current anchor. | |
param strict: | If element's margins should be used for matching those which are aligned to the anchor. (turned off by default) |
param regexp: | Expected format of the searched text value. By default all the candidates in range are considered valid neighbours. |
param trim: | Automatically trim leading/trailing whitespace from the text elements. (switched on by default) |
param ignore_case: | |
Do a case-insensitive search when set to True. (affects the passed locator and regexp filtering) | |
returns: | A list of Match objects where every match has the following attributes: .anchor - the matched text with the locator; .neighbours - a list of adjacent texts found on the specified direction |
Attention!
Keep in mind that this keyword works with text-based PDFs, and it can't extract information from an image-based (scan) PDF file. For accurate results, you have to use specialized external services wrapped by the RPA.DocumentAI library.
Portal example with video recording demo for parsing PDF invoices: https://github.com/robocorp/example-parse-pdf-invoice
Examples
Robot Framework
Python
Return all figures in the PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf. |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
return: | dictionary of figures divided into pages. |
Get input fields in the PDF.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | Filepath to source, if not given use the currently active PDF. |
replace_none_value | bool | False | Enable this to conveniently visualize the fields. ( replaces the null value with field's default or its name if absent) |
encoding | str, None | iso-8859-1 | Use an explicit encoding for field name/value parsing. ( defaults to "iso-8859-1" but "utf-8/16" might be the one working for you) |
Stores input fields internally so that they can be used without parsing the PDF again.
param source_path: | |
---|---|
Filepath to source, if not given use the currently active PDF. | |
param replace_none_value: | |
Enable this to conveniently visualize the fields. ( replaces the null value with field's default or its name if absent) | |
param encoding: | Use an explicit encoding for field name/value parsing. ( defaults to "iso-8859-1" but "utf-8/16" might be the one working for you) |
returns: | A dictionary with all the found fields. Use their key names when setting values into them. |
raises KeyError: | |
If no input fields are enabled in the PDF. |
Examples
Robot Framework
Python
Get number of pages in the document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf | |
raises PdfReadError: | |
if file is encrypted or other restrictions are in place |
Get metadata from a PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source PDF. |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source PDF. | |
return: | dictionary of PDF information. |
Get text from set of pages in source PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf. |
pages | int, str, List[int], List[str], None | None | page numbers to get text (numbers start from 1). |
details | bool | False | set to True to return textboxes, default False. |
trim | bool | True | set to False to return raw texts, default True means whitespace is trimmed from the text |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
param pages: | page numbers to get text (numbers start from 1). |
param details: | set to True to return textboxes, default False. |
param trim: | set to False to return raw texts, default True means whitespace is trimmed from the text |
return: | dictionary of pages and their texts. |
Generate a PDF file from HTML content.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
content | str | null | HTML content. |
output_path | str | null | Filepath where to save the PDF document. |
encoding | str | utf-8 | Codec used for text I/O. |
Note that input must be well-formed and valid HTML.
Examples
Robot Framework
param content: | HTML content. |
---|---|
param output_path: | |
Filepath where to save the PDF document. | |
param encoding: | Codec used for text I/O. |
Check if PDF is encrypted.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to the source pdf. |
If no source path given, assumes a PDF is already opened.
param source_path: | |
---|---|
filepath to the source pdf. | |
return: | True if file is encrypted. |
Examples
Robot Framework
Python
Open a PDF document for reading.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, Path | null | filepath to the source pdf. |
This is called automatically in the other PDF keywords when a path to the PDF file is given as an argument.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
raises ValueError: | |
if PDF is already open. |
Rotate pages in source PDF document and save to target PDF document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
pages | int, str, List[int], List[str], None | null | page numbers to extract from PDF (numbers start from 1). |
source_path | str, None | None | filepath to the source pdf. |
output_path | str, None | None | filepath to the target pdf, stored by default in the robot output directory as output.pdf |
clockwise | bool | True | directorion that page will be rotated to, default True. |
angle | int | 90 | number of degrees to rotate, default 90. |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param pages: | page numbers to extract from PDF (numbers start from 1). |
---|---|
param source_path: | |
filepath to the source pdf. | |
param output_path: | |
filepath to the target pdf, stored by default in the robot output directory as output.pdf | |
param clockwise: | |
directorion that page will be rotated to, default True. | |
param angle: | number of degrees to rotate, default 90. |
Save field values in PDF if it has fields.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | Source PDF with fields to update. |
output_path | str, None | None | Updated target PDF. |
newvals | dict, None | None | New values when updating many at once. |
use_appearances_writer | bool | False | For some PDF documents the updated fields won't be visible (or will look strange). When this happens, try to set this to True so the previewer will re-render these based on the actual values. (and viewing the output PDF in a browser might display the field values correcly then) |
param source_path: | |
---|---|
Source PDF with fields to update. | |
param output_path: | |
Updated target PDF. | |
param newvals: | New values when updating many at once. |
param use_appearances_writer: | |
For some PDF documents the updated fields won't be visible (or will look strange). When this happens, try to set this to True so the previewer will re-render these based on the actual values. (and viewing the output PDF in a browser might display the field values correcly then) |
Examples
Robot Framework
Python
Try to save the image data from Figure object, and return the file name, if successful.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
figure | Figure | null | PDF Figure object which will be saved as an image. The PDF Figure object can be determined from the Get All Figures keyword |
images_folder | str | . | directory where image files will be created |
file_prefix | str | image filename prefix |
Figure needs to have byte stream and that needs to be recognized as image format for successful save.
Examples
Robot Framework
Python
param figure: | PDF Figure object which will be saved as an image. The PDF Figure object can be determined from the Get All Figures keyword |
---|---|
param images_folder: | |
directory where image files will be created | |
param file_prefix: | |
image filename prefix | |
return: | image filepath or None |
Save figures from given PDF document as image files.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, None | None | filepath to PDF document |
images_folder | str | . | directory where image files will be created |
pages | str, None | None | target figures in the pages, can be single page or range, default None means that all pages are scanned for figures to save (numbers start from 1) |
file_prefix | str | image filename prefix |
If no source path given, assumes a PDF is already opened.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to PDF document | |
param images_folder: | |
directory where image files will be created | |
param pages: | target figures in the pages, can be single page or range, default None means that all pages are scanned for figures to save (numbers start from 1) |
param file_prefix: | |
image filename prefix | |
return: | list of image filenames created |
Save the contents of a pypdf reader to a new file.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
output_path | str | null | filepath to target PDF |
reader | PdfReader, None | None | a pypdf reader (defaults to currently active document) |
Examples
Robot Framework
Python
param output_path: | |
---|---|
filepath to target PDF | |
param reader: | a pypdf reader (defaults to currently active document) |
Sets main anchor point in the document for further searches.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
locator | str | null | Element to set anchor to. This can be prefixed with either "text:", "subtext:", "regex:" or "coords:" to find the anchor by text or coordinates. The "text" strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it) |
trim | bool | True | Automatically trim leading/trailing whitespace from the text elements. (switched on by default) |
pagenum | int, str | 1 | Page number where search is performed on, defaults to 1 (first page). |
ignore_case | bool | False | Do a case-insensitive search when set to True. |
This is used internally in the library and can work with multiple anchors at the same time if such are found.
param locator: | Element to set anchor to. This can be prefixed with either "text:", "subtext:", "regex:" or "coords:" to find the anchor by text or coordinates. The "text" strategy is assumed if no such prefix is specified. (text search is case-sensitive; use ignore_case param for controlling it) |
---|---|
param trim: | Automatically trim leading/trailing whitespace from the text elements. (switched on by default) |
param pagenum: | Page number where search is performed on, defaults to 1 (first page). |
param ignore_case: | |
Do a case-insensitive search when set to True. | |
returns: | True if at least one anchor was found. |
Examples
Robot Framework
Python
Change settings for PDFMiner document conversion.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
line_margin | float, None | None | relative margin between bounding lines, default 0.5 |
word_margin | float, None | None | relative margin between words, default 0.1 |
char_margin | float, None | None | relative margin between characters, default 2.0 |
boxes_flow | float, None | 0.5 | positioning of the text boxes based on text, default 0.5 |
line_margin controls how textboxes are grouped - if conversion results in texts grouped into one group then set this to lower value
word_margin controls how spaces are inserted between words - if conversion results in text without spaces then set this to lower value
char_margin controls how characters are grouped into words - if conversion results in individual characters instead of then set this to higher value
boxes_flow controls how much horizontal and vertical position of the text matters when determining the order of text boxes. Value can be between range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters). This feature (advanced layout analysis) can be disabled by setting value to None thus bottom left corner of the text box is used to determine order of the text boxes.
param line_margin: | |
---|---|
relative margin between bounding lines, default 0.5 | |
param word_margin: | |
relative margin between words, default 0.1 | |
param char_margin: | |
relative margin between characters, default 2.0 | |
param boxes_flow: | |
positioning of the text boxes based on text, default 0.5 |
Examples
Robot Framework
Python
Set value for field with given name on the active document.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
field_name | str | null | Field to update. |
value | Any | null | New value for the field. |
source_path | str, None | None | Source PDF file path. |
Tries to match with field's identifier directly or its label. When ticking checkboxes, try with the /Yes string value as simply Yes might not work with most previewing apps.
param field_name: | |
---|---|
Field to update. | |
param value: | New value for the field. |
param source_path: | |
Source PDF file path. | |
raises ValueError: | |
When field can't be found or more than one field matches the given field_name. |
Examples
Robot Framework
Python
Switch library's current fileobject to already opened file or open a new file if not opened.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
source_path | str, Path, None | None | filepath to the source pdf. |
This is done automatically in the PDF library keywords.
Examples
Robot Framework
Python
param source_path: | |
---|---|
filepath to the source pdf. | |
raises ValueError: | |
if PDF filepath is not given and there are no active file to activate. |
Use HTML template file to generate PDF file.
Arguments
Argument | Type | Default value | Description |
---|---|---|---|
template | str | null | Filepath to the HTML template. |
output_path | str | null | Filepath where to save PDF document. |
variables | dict, None | None | Dictionary of variables to fill into template, defaults to {}. |
encoding | str | utf-8 | Codec used for text I/O. |
It provides an easy method of generating a PDF document from an HTML formatted template file.
Examples
Robot Framework
Python
param template: | Filepath to the HTML template. |
---|---|
param output_path: | |
Filepath where to save PDF document. | |
param variables: | |
Dictionary of variables to fill into template, defaults to {}. | |
param encoding: | Codec used for text I/O. |