Web scraper robot

This robot is included in our downloadable example robots. You can also find the code at the example robots repository.

When run, the robot will:

  • open a real web browser
  • accept the cookie notice if it appears
  • collect the latest tweets by given Twitter user
  • create a file system directory by the name of the Twitter user
  • store the text content of each tweet in separate files in the directory
  • store a screenshot of each tweet in the directory

Tweet screenshot

Run this robot locally in Robocorp Lab

You can run this robot on your local machine using Robocorp Lab:

  1. Set up your development environment.
  2. Download the example robots.
  3. Open the twitter-web-scraper example.
  4. Open the tasks.robot file and run it.

Robot script

*** Settings ***
Documentation     Opens the Twitter web page and stores some content.
Library           RPA.Browser
Library           RPA.FileSystem

*** Variables ***
${NUMBER_OF_TWEETS}=    3
${TWITTER_URL}=    https://twitter.com
${USER_NAME}=     RobocorpInc

*** Keywords ***
Store the latest ${number_of_tweets} tweets by user name "${user_name}"
    Open Twitter homepage    ${user_name}
    Run Keyword And Ignore Error    Accept cookie notice
    Store tweets    ${user_name}    ${number_of_tweets}
    [Teardown]    Close Browser

*** Keywords ***
Open Twitter homepage
    [Arguments]    ${user_name}
    Open Available Browser    ${TWITTER_URL}/${user_name}

*** Keywords ***
Accept cookie notice
    ${cookie_acceptance_link}=    Get cookie acceptance link locator
    Click Element When Visible    ${cookie_acceptance_link}

*** Keywords ***
Store tweets
    [Arguments]    ${user_name}    ${number_of_tweets}
    ${tweets_locator}=    Get tweets locator    ${user_name}
    Wait Until Element Is Visible    ${tweets_locator}
    @{tweets}=    Get WebElements    ${tweets_locator}
    ${tweet_directory}=    Get tweet directory    ${user_name}
    Create Directory    ${tweet_directory}    parents=True
    ${index}=    Set Variable    1
    FOR    ${tweet}    IN    @{tweets}
        Exit For Loop If    ${index} > ${number_of_tweets}
        ${screenshot_file}=    Set Variable    ${tweet_directory}/tweet-${index}.png
        ${text_file}=    Set Variable    ${tweet_directory}/tweet-${index}.txt
        ${text}=    Set Variable    ${tweet.find_element_by_xpath(".//div[@lang='en']").text}
        Capture Element Screenshot    ${tweet}    ${screenshot_file}
        Create File    ${text_file}    ${text}    overwrite=True
        ${index}=    Evaluate    ${index} + 1
    END

*** Keywords ***
Get tweets locator
    [Arguments]    ${user_name}
    [Return]    xpath://article[descendant::span[contains(text(), "\@${user_name}")]]

*** Keywords ***
Get cookie acceptance link locator
    [Return]    xpath://span[contains(text(), "Close")]

*** Keywords ***
Get tweet directory
    [Arguments]    ${user_name}
    [Return]    ${CURDIR}${/}output${/}tweets/${user_name}

*** Tasks ***
Store the latest tweets by given user name
    Store the latest ${NUMBER_OF_TWEETS} tweets by user name "${USER_NAME}"

The robot should have created a directory output/tweets/RobocorpInc containing images (screenshots of the tweets) and text files (the texts of the tweets).

Robot script explained

The main robot file (.robot) contains the task(s) your robot is going to complete when run.

Tasks section defines the tasks for the robot.

Store the latest tweets by given user name is the name of the task.

Store the latest ${NUMBER_OF_TWEETS} tweets by user name "${USER_NAME}" is a keyword call. The keyword is imported from the keywords.robot file where it is implemented.

${NUMBER_OF_TWEETS} and ${USER_NAME} are references to variables defined in the Variables section.

Settings

*** Settings ***
Documentation     Opens the Twitter web page and stores some content.
Library           RPA.Browser
Library           RPA.FileSystem

The Settings section provides short documentation (Documentation) for the script and import libraries (Library) that add new keywords for the robot to use.

Libraries typically contain Python code that accomplishes tasks, such as commanding a web browser (RPA.Browser) and creating file system directories and files (RPA.FileSystem).

Variables

*** Variables ***
${NUMBER_OF_TWEETS}=    3
${TWITTER_URL}=    https://twitter.com
${USER_NAME}=     RobocorpInc

Variables provide a way to change the input values for the robot in one place.

Keywords section

*** Keywords ***
Store the latest ${number_of_tweets} tweets by user name "${user_name}"
    Open Twitter homepage    ${user_name}
    Run Keyword And Ignore Error    Accept cookie notice
    Store tweets    ${user_name}    ${number_of_tweets}
    [Teardown]    Close Browser

The Keywords section defines the keywords for the robot and define the implementation of the concrete things the robot will do. Robocorp Lab's Notebook mode works best when each keyword is annotated with the *** Keywords *** heading.

Store the latest ${number_of_tweets} tweets by user name "${user_name}" is a keyword that takes two arguments: ${number_of_tweets} and ${user_name}.

The keyword is called in the Tasks section, providing values for the arguments.

In this case, the default value for the number of tweets is 3, and the default value for the user name is RobocorpInc. With those values, the keyword implementation might look like this after Robot Framework has parsed the provided values:

*** Keywords ***
Store the latest ${number_of_tweets} tweets by user name "${user_name}"
    Open Twitter homepage    RobocorpInc
    Run Keyword And Ignore Error    Accept cookie notice
    Store tweets    RobocorpInc    3
    [Teardown]    Close Browser

Keywords can call other keywords.

Open Twitter homepage is another keyword. It takes one argument: ${user_name}.

Run Keyword And Ignore Error is a built-in keyword that takes another keyword as an argument (in this case, Accept cookie notice). The Run Keyword And Ignore Error is useful here since the cookie notice might or might not appear, depending on browser cookie settings and other factors. If the cookie notice appears, we want the robot to accept it to get rid of the notice. If the cookie notice does not appear, we are happy with that outcome too, and continue with the script.

Store tweets keyword takes two arguments: ${user_name} and ${number_of_tweets}.

[Teardown] tells Robot Framework to run the given keyword (Close Browser) always as the last step. [Teardown] will always run, even if the steps before it would fail for any reason.

*** Keywords ***
Open Twitter homepage
    [Arguments]    ${user_name}
    Open Available Browser    ${TWITTER_URL}/${user_name}

Open Twitter homepage is one of your keywords. It is not provided by any external library. You can define as many keywords as you need. Your keywords can call other keywords, both your own and keywords provided by libraries.

    [Arguments]    ${user_name}

[Arguments] line should be read from left to right. [Arguments] line tells Robot Framework the names of the arguments this keyword expects. In this case, there is one argument: ${user_name}.

    Open Available Browser    ${TWITTER_URL}/${user_name}

Open Available Browser is a keyword provided by the RPA.Browser library. In this case, you call it with one argument: the URL (https://twitter.com/RobocorpInc).

The arguments here reference both a variable (${TWITTER_URL}, defined in the Variables section) and an argument (${user_name}, provided when calling your keyword).

*** Keywords ***
Accept cookie notice
    ${cookie_acceptance_link}=    Get cookie acceptance link locator
    Click Element When Visible    ${cookie_acceptance_link}

The Accept cookie notice keyword uses the Click Element When Visible to wait until the cookie acceptance link becomes visible before clicking on it.

*** Keywords ***
Store tweets
    [Arguments]    ${user_name}    ${number_of_tweets}
    ${tweets_locator}=    Get tweets locator    ${user_name}
    Wait Until Element Is Visible    ${tweets_locator}
    @{tweets}=    Get WebElements    ${tweets_locator}
    ${tweet_directory}=    Get tweet directory    ${user_name}
    Create Directory    ${tweet_directory}    parents=True
    ${index}=    Set Variable    1
    FOR    ${tweet}    IN    @{tweets}
        Exit For Loop If    ${index} > ${number_of_tweets}
        ${screenshot_file}=    Set Variable    ${tweet_directory}/tweet-${index}.png
        ${text_file}=    Set Variable    ${tweet_directory}/tweet-${index}.txt
        ${text}=    Set Variable    ${tweet.find_element_by_xpath(".//div[@lang='en']").text}
        Capture Element Screenshot    ${tweet}    ${screenshot_file}
        Create File    ${text_file}    ${text}
        ${index}=    Evaluate    ${index} + 1
    END

Store tweets keyword contains the steps for collecting and storing a screenshot and the text of each tweet.

This keyword could also be provided by a library. Libraries are typically used when the implementation might be complex and would be difficult to implement using Robot Framework syntax. Using ready-made libraries is recommended to avoid unnecessary time spent on implementing your own solution if a ready-made solution exists.

RPA Framework provides many open-source libraries for typical RPA (Robotic Process Automation) use cases.

In this example, Robot Framework syntax is used as an example of what kind of "programming" logic is possible with Robot Framework syntax.

More complex business logic is better implemented by taking advantage of libraries (such as RPA.Browser, RPA.FileSystem or your own library).

[Arguments]    ${user_name}    ${number_of_tweets}

Store tweets takes two arguments: ${user_name} and ${number_of_tweets}.

${tweets_locator}=    Get tweets locator    ${user_name}

A locator (an instruction for the browser to find specific element(s)) is provided by the keyword Get tweets locator that takes one argument: ${user_name}. The computed locator is stored in a local variable ${tweets_locator}. Having the assignment symbol (=) is not required, but including it is a recommended convention for communicating the intent of the assignment.

Inspecting the DOM to find tweet elements Inspecting the DOM to find tweet elements

In this case the returned locator is an XPath expression (//article[descendant::span[contains(text(), "@RobocorpInc")]]) prefixed by SeleniumLibrary specific xpath: prefix.

Tip: You can test XPath expressions in Firefox and in Chrome. Right-click on a web page and select Inspect or Inspect Element to open up the developer tools. Select the Console tab. In the console, type $x('//div') and hit Enter. The console will display the matched elements (in this case, all the div elements). Experiment with your query until it works. You can use the query with SeleniumLibrary as an element locator by prefixing the query with xpath:.

Locating elements in browser console with XPath Locating elements in browser console with XPath

You can learn more about how to find interface elements in web applications and the concept of locators in this dedicated article.

Wait Until Element Is Visible    ${tweets_locator}

Wait Until Element Is Visible is a keyword provided by the RPA.Browser library. It takes a locator as an argument and waits for the element to be visible or until timeout (five seconds by default).

@{tweets}=    Get WebElements    ${tweets_locator}

Get WebElements keyword (RPA.Browser) is used to find and return elements matching the given locator argument (${tweets_locator}). The elements are stored in a local list variable, @{tweets}. List variables start with @ instead of $.

${tweet_directory}=    Get tweet directory    ${user_name}

Get tweet directory keyword returns a directory path based on the given ${user_name} argument. The path is stored in a local ${tweet_directory} variable.

Create Directory    ${tweet_directory}    parents=True

Create Directory keyword is provided by the RPA.FileSystem library. It creates a file system directory based on the given path argument (${tweet_directory}). The parents=True argument enables the creation of the parent directories if they do not exist.

${index}=    Set Variable    1

Set Variable keyword is used to assign raw values to variables. In this case, a local variable ${index} is created with the value of 1. This variable keeps track of the loop index in order to create unique names for the stored files.

    FOR    ${tweet}    IN    @{tweets}
        ...
    END

Robot Framework supports loops using the FOR syntax. The found tweet elements are looped, and a set of steps is executed for each tweet.

Exit For Loop If    ${index} > ${number_of_tweets}

Exit For Loop If keyword is used to terminate the loop when the given condition returns True. In this case, the loop is terminated when the given amount of tweets have been processed.

Capture Element Screenshot and Create File keywords are used to take a screenshot of each element and to create a text file containing the element text.

${index}=    Evaluate    ${index} + 1

The previously initialized ${index} variable is incremented by one at the end of each loop iteration using the Evaluate keyword. Evaluate takes an expression as an argument and returns the evaluated value.

Get tweets locator
    [Arguments]    ${user_name}
    [Return]    xpath://article[descendant::span[contains(text(), "\@${user_name}")]]

Get tweets locator keyword returns an element locator based on the given ${user_name} argument. Robot Framework uses [Return] syntax for returning values.

Get tweet directory
    [Arguments]    ${user_name}
    [Return]    ${CURDIR}${/}output${/}tweets/${user_name}

Get tweet directory keyword implementation uses one of the prefined variables in Robot Framework. ${CURDIR} returns the current working directory.

Summary

You executed a web scraper robot, congratulations!

During the process, you learned some concepts and features of the Robot Framework and some good practices:

  • Defining Settings for your script (*** Settings ***)
  • Documenting scripts (Documentation)
  • Importing libraries (RPA.Browser, RPA.FileSystem)
  • Using keywords provided by libraries (Open Available Browser)
  • Creating your own keywords
  • Defining arguments ([Arguments])
  • Calling keywords with arguments
  • Returning values from keywords ([Return])
  • Using predefined variables (${CURDIR})
  • Using your own variables
  • Creating loops with Robot Framework syntax
  • Running teardown steps ([Teardown])
  • Opening a real browser
  • Navigating to web pages
  • Locating web elements
  • Building and testing locators ($x('//div'))
  • Scraping text from web elements
  • Taking screenshots of web elements
  • Creating file system directories
  • Creating and writing to files
  • Ignoring errors when it makes sense (Run Keyword And Ignore Error with Accept cookie notice)