Web scraper robot

This robot is included in our downloadable example robots. You can also find the code at the example robots repository.

When run, the robot will:

  • open a real web browser
  • accept the cookie notice if it appears
  • hide distracting UI elements
  • scroll down to load dynamic content
  • collect the latest tweets by a given Twitter user
  • create a file system directory by the name of the Twitter user
  • store the text content of each tweet in separate files in the directory
  • store a screenshot of each tweet in the directory

Tweet screenshot

Because Twitter blocks requests coming from the cloud, this robot can only be executed on a local machine: either with Robocorp Lab or triggered from the cloud with Robocorp App.

Run this robot locally in Robocorp Lab

You can run this robot on your local machine using Robocorp Lab:

  1. Set up your development environment.
  2. Download the example robots.
  3. Open the twitter-web-scraper example.
  4. Open the tasks.robot file and run it.

Robot script

# ## Twitter web scraper example
# Opens the Twitter web page and stores some content.

*** Settings ***
Documentation     Opens the Twitter web page and stores some content.
Library           Collections
Library           RPA.Browser
Library           RPA.FileSystem
Library           RPA.RobotLogListener

*** Variables ***
${USER_NAME}=     RobocorpInc
${NUMBER_OF_TWEETS}=    3
${TWEET_DIRECTORY}=    ${CURDIR}${/}output${/}tweets/${USER_NAME}
${TWEETS_LOCATOR}=    xpath://article[descendant::span[contains(text(), "\@${USER_NAME}")]]

*** Keywords ***
Open Twitter homepage
    Open Available Browser    https://mobile.twitter.com/${USER_NAME}

*** Keywords ***
Accept the cookie notice
    Run Keyword And Ignore Error
    ...    Click Element When Visible
    ...    xpath://span[contains(text(), "Close")]

*** Keywords ***
Hide element
    [Arguments]    ${locator}
    Mute Run On Failure    Execute Javascript
    Run Keyword And Ignore Error
    ...    Execute Javascript
    ...    document.querySelector('${locator}').style.display = 'none'

*** Keywords ***
Hide distracting UI elements
    @{locators}=    Create List
    ...    header
    ...    \#layers > div
    ...    nav
    ...    div[data-testid="primaryColumn"] > div > div
    ...    div[data-testid="sidebarColumn"]
    FOR    ${locator}    IN    @{locators}
        Hide element    ${locator}
    END

*** Keywords ***
Scroll down to load dynamic content
    FOR    ${pixels}    IN RANGE    200    2000    200
        Execute Javascript    window.scrollBy(0, ${pixels})
        Sleep    500ms
    END
    Execute Javascript    window.scrollTo(0, 0)
    Sleep    500ms

*** Keywords ***
Get tweets
    Wait Until Element Is Visible    ${TWEETS_LOCATOR}
    @{all_tweets}=    Get WebElements    ${TWEETS_LOCATOR}
    @{tweets}=    Get Slice From List    ${all_tweets}    0    ${NUMBER_OF_TWEETS}
    [Return]    @{tweets}

*** Keywords ***
Store the tweets
    Create Directory    ${TWEET_DIRECTORY}    parents=True
    ${index} =    Set Variable    1
    @{tweets}=    Get tweets
    FOR    ${tweet}    IN    @{tweets}
        ${screenshot_file}=    Set Variable    ${TWEET_DIRECTORY}/tweet-${index}.png
        ${text_file}=    Set Variable    ${TWEET_DIRECTORY}/tweet-${index}.txt
        ${text}=    Set Variable    ${tweet.find_element_by_xpath(".//div[@lang='en']").text}
        Screenshot    ${tweet}    ${screenshot_file}
        Create File    ${text_file}    ${text}    overwrite=True
        ${index} =    Evaluate    ${index} + 1
    END

*** Tasks ***
Store the latest tweets by given user name
    Open Twitter homepage
    Accept the cookie notice
    Hide distracting UI elements
    Scroll down to load dynamic content
    Store the tweets
    [Teardown]    Close Browser

The robot should have created a directory output/tweets/RobocorpInc containing images (screenshots of the tweets) and text files (the texts of the tweets).

Robot script explained

Tasks

The main robot file (tasks.robot) contains the tasks your robot is going to complete when run:

*** Tasks ***
Store the latest tweets by given user name
    Open Twitter homepage
    Accept the cookie notice
    Hide distracting UI elements
    Scroll down to load dynamic content
    Store the tweets
    [Teardown]    Close Browser
  • *** Tasks *** is the section header.
  • Store the latest tweets by given user name is the name of the task.
  • Open Twitter homepage, Accept the cookie notice, etc., are keyword calls.
  • [Teardown] Close Browser ensures that the browser will be closed even if the robot fails to accomplish its task.

Settings

*** Settings ***
Documentation     Opens the Twitter web page and stores some content.
Library           Collections
Library           RPA.Browser
Library           RPA.FileSystem
Library           RPA.RobotLogListener

The *** Settings *** section provides short Documentation for the script and imports libraries (Library) that add new keywords for the robot to use. Libraries contain Python code for things like commanding a web browser (RPA.Browser) or creating file system directories and files (RPA.FileSystem).

Variables

*** Variables ***
${USER_NAME}=     RobocorpInc
${NUMBER_OF_TWEETS}=    3
${TWEET_DIRECTORY}=    ${CURDIR}${/}output${/}tweets/${USER_NAME}
${TWEETS_LOCATOR}=    xpath://article[descendant::span[contains(text(), "\@${USER_NAME}")]]

Variables provide a way to change the input values for the robot in one place. This robot provides variables for the Twitter user name, the number of tweets to collect, the file system directory path for storing the tweets, and a locator for finding the tweet HTML elements from the Twitter web page.

Keywords

The Keywords section defines the implementation of the actual things the robot will do. Robocorp Lab's Notebook mode works best when each keyword is annotated with the *** Keywords *** heading.

*** Keywords ***
Open Twitter homepage
    Open Available Browser    https://mobile.twitter.com/${USER_NAME}

The Open Twitter homepage keyword uses the Open Available Browser keyword from the RPA.Browser library to open a browser. It takes one required argument; the URL to open:

https://mobile.twitter.com/${USER_NAME}

The ${USER_NAME} variable is defined in the *** Variables *** section:

${USER_NAME}=     RobocorpInc

The value of the variable is RobocorpInc. When the robot is executed, Robot Framework replaces the variable with its value, and the URL becomes https://mobile.twitter.com/RobocorpInc.

*** Keywords ***
Accept the cookie notice
    Run Keyword And Ignore Error
    ...    Click Element When Visible
    ...    xpath://span[contains(text(), "Close")]

The Accept the cookie notice keyword calls the Click Element When Visible keyword that waits for the Twitter cookie notice to appear before clicking on it. If the cookie notice does not appear for any reason, the Run Keyword And Ignore Error keyword tells the robot to ignore the error and carry on. The ... syntax is used to break keyword arguments on separate lines to keep the robot code lines shorter and more readable.

*** Keywords ***
Hide element
    [Arguments]    ${locator}
    Mute Run On Failure    Execute Javascript
    Run Keyword And Ignore Error
    ...    Execute Javascript
    ...    document.querySelector('${locator}').style.display = 'none'

The Hide element keyword takes care of getting rid of unnecessary elements from the web page when taking screenshots. The Mute Run On Failure keyword from the RPA.RobotLogListener library prevents the robot from saving a screenshot in case of failure (the default behavior on failure) when executing the Execute Javascript keyword. In this case, we are not really interested in these failures, so we decided to mute the failure behavior.

The Execute Javascript keyword executes the given JavaScript in the browser. The JavaScript expression contains a variable (${locator}) that is passed in as an argument for the Hide element keyword.

*** Keywords ***
Hide distracting UI elements
    @{locators}=    Create List
    ...    header
    ...    \#layers > div
    ...    nav
    ...    div[data-testid="primaryColumn"] > div > div
    ...    div[data-testid="sidebarColumn"]
    FOR    ${locator}    IN    @{locators}
        Hide element    ${locator}
    END

The Hide distracting UI elements keyword calls the Hide element keyword with locators pointing to all the elements we want the robot to hide from the web page. A for loop is used to loop through the list of locators.

*** Keywords ***
Scroll down to load dynamic content
    FOR    ${pixels}    IN RANGE    200    2000    200
        Execute Javascript    window.scrollBy(0, ${pixels})
        Sleep    500ms
    END
    Execute Javascript    window.scrollTo(0, 0)
    Sleep    500ms

The Scroll down to load dynamic content keyword ensures that the dynamic content is loaded before the robot tries to store the tweets. It scrolls down the browser window, starting from 200 pixels from the top of the page, until 2000 pixels down, in 200-pixel steps. The Sleep keyword provides some time for the dynamic content to load. Finally, the web page is scrolled back to the top.

*** Keywords ***
Get tweets
    Wait Until Element Is Visible    ${TWEETS_LOCATOR}
    @{all_tweets}=    Get WebElements    ${TWEETS_LOCATOR}
    @{tweets}=    Get Slice From List    ${all_tweets}    0    ${NUMBER_OF_TWEETS}
    [Return]    @{tweets}

The Get tweets keyword collects the tweet HTML elements from the web page using the Get WebElements keyword. The Get Slice From List keyword is used to limit the number of elements before returning them using the [Return] keyword.

*** Keywords ***
Store the tweets
    Create Directory    ${TWEET_DIRECTORY}    parents=True
    ${index} =    Set Variable    1
    @{tweets}=    Get tweets
    FOR    ${tweet}    IN    @{tweets}
        ${screenshot_file}=    Set Variable    ${TWEET_DIRECTORY}/tweet-${index}.png
        ${text_file}=    Set Variable    ${TWEET_DIRECTORY}/tweet-${index}.txt
        ${text}=    Set Variable    ${tweet.find_element_by_xpath(".//div[@lang='en']").text}
        Screenshot    ${tweet}    ${screenshot_file}
        Create File    ${text_file}    ${text}    overwrite=True
        ${index} =    Evaluate    ${index} + 1
    END

The Store the tweets keyword stores the text and the screenshot of each tweet. It uses the Create Directory keyword to create a file system directory, the Set Variable keyword to create local variables, the Screenshot keyword to take the screenshots, the Create File keyword to create the files, and the Evaluate keyword to evaluate a Python expression.

The file paths are constructed dynamically using variables:

${TWEET_DIRECTORY}/tweet-${index}.png

Summary

You executed a web scraper robot, congratulations!

During the process, you learned some concepts and features of the Robot Framework and some good practices:

  • Defining Settings for your script (*** Settings ***)
  • Documenting scripts (Documentation)
  • Importing libraries (Collections, RPA.Browser, RPA.FileSystem, RPA.RobotLogListener)
  • Using keywords provided by libraries (Open Available Browser)
  • Creating your own keywords
  • Defining arguments ([Arguments])
  • Calling keywords with arguments
  • Returning values from keywords ([Return])
  • Using predefined variables (${CURDIR})
  • Using your own variables
  • Creating loops with Robot Framework syntax
  • Running teardown steps ([Teardown])
  • Opening a real browser
  • Navigating to web pages
  • Locating web elements
  • Hiding web elements
  • Executing Javascript code
  • Scraping text from web elements
  • Taking screenshots of web elements
  • Creating file system directories
  • Creating and writing to files
  • Ignoring errors when it makes sense (Run Keyword And Ignore Error)