Robot

Simple Web Scraper

Opens a web page and stores some content.

Robocorp

An example robot. Opens a web page and stores some content. The web page text is stored in the "output" directory. An image screenshot is embedded in the log.

When run, the robot will:

  • open a real web browser
  • hide distracting UI elements
  • scroll down to load dynamic content
  • collect the latest tweets by a given Twitter user
  • create a file system directory by the name of the Twitter user
  • store the text content of each tweet in separate files in the directory
  • store a screenshot of each tweet in the directory

Tweet screenshot

Because Twitter blocks requests coming from the cloud, this robot can only be executed on a local machine or triggered from Control Room using Robocorp Assistant.

Robot script explained

After running it, the robot should have created a directory output/tweets/RobocorpInc containing images (screenshots of the tweets) and text files (the texts of the tweets).

Tasks

The main robot file (tasks.robot) contains the tasks your robot is going to complete when run:

  • *** Tasks *** is the section header.
  • Store the latest tweets by given user name is the name of the task.
  • Open Twitter homepage, etc., are keyword calls.
  • [Teardown] Close Browser ensures that the browser will be closed even if the robot fails to accomplish its task.

Settings

The *** Settings *** section provides short Documentation for the script and imports libraries (Library) that add new keywords for the robot to use. Libraries contain Python code for things like commanding a web browser (RPA.Browser.Selenium) or creating file system directories and files (RPA.FileSystem).

Variables

Variables provide a way to change the input values for the robot in one place. This robot provides variables for the Twitter user name, the number of tweets to collect, the file system directory path for storing the tweets, and a locator for finding the tweet HTML elements from the Twitter web page.

Keywords

The Keywords section defines the implementation of the actual things the robot will do.

The Open Twitter homepage keyword uses the Open Available Browser keyword from the RPA.Browser.Selenium library to open a browser. It takes one required argument; the URL to open:

The ${USER_NAME} variable is defined in the *** Variables *** section:

The value of the variable is RobocorpInc. When the robot is executed, Robot Framework replaces the variable with its value, and the URL becomes https://mobile.twitter.com/RobocorpInc.

The Hide element keyword takes care of getting rid of unnecessary elements from the web page when taking screenshots. The Mute Run On Failure keyword from the RPA.RobotLogListener library prevents the robot from saving a screenshot in case of failure (the default behavior on failure) when executing the Execute Javascript keyword. In this case, we are not really interested in these failures, so we decided to mute the failure behavior.

The Execute Javascript keyword executes the given JavaScript in the browser. The JavaScript expression contains a variable (${locator}) that is passed in as an argument for the Hide element keyword.

The Hide distracting UI elements keyword calls the Hide element keyword with locators pointing to all the elements we want the robot to hide from the web page. A for loop is used to loop through the list of locators.

The Scroll down to load dynamic content keyword ensures that the dynamic content is loaded before the robot tries to store the tweets. It scrolls down the browser window, starting from 200 pixels from the top of the page, until 2000 pixels down, in 200-pixel steps. The Sleep keyword provides some time for the dynamic content to load. Finally, the web page is scrolled back to the top.

The Get tweets keyword collects the tweet HTML elements from the web page using the Get WebElements keyword. The Get Slice From List keyword is used to limit the number of elements before returning them using the RETURN keyword.

The Store the tweets keyword stores the text and the screenshot of each tweet. It uses the Create Directory keyword to create a file system directory, the Set Variable keyword to create local variables, the Capture Element Screenshot keyword to take the screenshots, the Create File keyword to create the files, and the Evaluate keyword to evaluate a Python expression.

The file paths are constructed dynamically using variables:

Summary

You executed a web scraper robot, congratulations!

During the process, you learned some concepts and features of the Robot Framework and some good practices:

  • Defining Settings for your script (*** Settings ***)
  • Documenting scripts (Documentation)
  • Importing libraries (Collections, RPA.Browser.Selenium, RPA.FileSystem, RPA.RobotLogListener)
  • Using keywords provided by libraries (Open Available Browser)
  • Creating your own keywords
  • Defining arguments ([Arguments])
  • Calling keywords with arguments
  • Returning values from keywords (RETURN)
  • Using predefined variables (${CURDIR})
  • Using your own variables
  • Creating loops with Robot Framework syntax
  • Running teardown steps ([Teardown])
  • Opening a real browser
  • Navigating to web pages
  • Locating web elements
  • Hiding web elements
  • Executing Javascript code
  • Scraping text from web elements
  • Taking screenshots of web elements
  • Creating file system directories
  • Creating and writing to files
  • Ignoring errors when it makes sense (Run Keyword And Ignore Error)

Technical information

Last updated

1 November 2023

License

Apache License 2.0