Simple Web Scraper
Opens a web page and stores some content.
An example robot. Opens a web page and stores some content. The web page text is stored in the "output" directory. An image screenshot is embedded in the log.
When run, the robot will:
- open a real web browser
- hide distracting UI elements
- scroll down to load dynamic content
- collect the latest tweets by a given Twitter user
- create a file system directory by the name of the Twitter user
- store the text content of each tweet in separate files in the directory
- store a screenshot of each tweet in the directory
Because Twitter blocks requests coming from the cloud, this robot can only be executed on a local machine or triggered from Control Room using Robocorp Assistant.
Robot script explained
After running it, the robot should have created a directory output/tweets/RobocorpInc
containing images (screenshots of the tweets) and text files (the texts of the tweets).
Tasks
The main robot file (tasks.robot
) contains the tasks your robot is going to complete when run:
*** Tasks ***
is the section header.Store the latest tweets by given user name
is the name of the task.Open Twitter homepage
, etc., are keyword calls.[Teardown] Close Browser
ensures that the browser will be closed even if the robot fails to accomplish its task.
Settings
The *** Settings ***
section provides short Documentation
for the script and imports libraries (Library
) that add new keywords for the robot to use. Libraries contain Python code for things like commanding a web browser (RPA.Browser.Selenium
) or creating file system directories and files (RPA.FileSystem
).
Variables
Variables provide a way to change the input values for the robot in one place. This robot provides variables for the Twitter user name, the number of tweets to collect, the file system directory path for storing the tweets, and a locator for finding the tweet HTML elements from the Twitter web page.
Keywords
The Keywords
section defines the implementation of the actual things the robot will do.
The Open Twitter homepage
keyword uses the Open Available Browser
keyword from the RPA.Browser.Selenium
library to open a browser. It takes one required argument; the URL to open:
The ${USER_NAME}
variable is defined in the *** Variables ***
section:
The value of the variable is RobocorpInc
. When the robot is executed, Robot Framework replaces the variable with its value, and the URL becomes https://mobile.twitter.com/RobocorpInc
.
The Hide element
keyword takes care of getting rid of unnecessary elements from the web page when taking screenshots. The Mute Run On Failure
keyword from the RPA.RobotLogListener
library prevents the robot from saving a screenshot in case of failure (the default behavior on failure) when executing the Execute Javascript
keyword. In this case, we are not really interested in these failures, so we decided to mute the failure behavior.
The Execute Javascript
keyword executes the given JavaScript in the browser. The JavaScript expression contains a variable (${locator}
) that is passed in as an argument for the Hide element
keyword.
The Hide distracting UI elements
keyword calls the Hide element
keyword with locators pointing to all the elements we want the robot to hide from the web page. A for loop is used to loop through the list of locators.
The Scroll down to load dynamic content
keyword ensures that the dynamic content is loaded before the robot tries to store the tweets. It scrolls down the browser window, starting from 200 pixels from the top of the page, until 2000 pixels down, in 200-pixel steps. The Sleep
keyword provides some time for the dynamic content to load. Finally, the web page is scrolled back to the top.
The Get tweets
keyword collects the tweet HTML elements from the web page using the Get WebElements
keyword. The Get Slice From List
keyword is used to limit the number of elements before returning them using the RETURN
keyword.
The Store the tweets
keyword stores the text and the screenshot of each tweet. It uses the Create Directory
keyword to create a file system directory, the Set Variable
keyword to create local variables, the Capture Element Screenshot
keyword to take the screenshots, the Create File
keyword to create the files, and the Evaluate
keyword to evaluate a Python expression.
The file paths are constructed dynamically using variables:
Summary
You executed a web scraper robot, congratulations!
During the process, you learned some concepts and features of the Robot Framework and some good practices:
- Defining
Settings
for your script (*** Settings ***
) - Documenting scripts (
Documentation
) - Importing libraries (
Collections
,RPA.Browser.Selenium
,RPA.FileSystem
,RPA.RobotLogListener
) - Using keywords provided by libraries (
Open Available Browser
) - Creating your own keywords
- Defining arguments (
[Arguments]
) - Calling keywords with arguments
- Returning values from keywords (
RETURN
) - Using predefined variables (
${CURDIR}
) - Using your own variables
- Creating loops with Robot Framework syntax
- Running teardown steps (
[Teardown]
) - Opening a real browser
- Navigating to web pages
- Locating web elements
- Hiding web elements
- Executing Javascript code
- Scraping text from web elements
- Taking screenshots of web elements
- Creating file system directories
- Creating and writing to files
- Ignoring errors when it makes sense (
Run Keyword And Ignore Error
)
Technical information
Last updated
1 November 2023License
Apache License 2.0