Beautiful Soup is a Python library specialized for pulling data out of HTML and XML files.

At some extent it can be used to modify the HTML as well.


from bs4 import BeautifulSoup html = """ <!DOCTYPE html> <head> <title>My page</title> </head> <body> <p class="title"><b>My page</b></p> <p class="story">My sisters: <a href="" class="sister" id="link1">Maria</a>, <a href="" class="sister" id="link2">Diana</a> and </p> </body> </html> """ soup = BeautifulSoup(html, 'html.parser') # find all links in the HTML and print their href attributes for link in soup.find_all('a'): print(link.get('href'))

AI/LLM's are quite good with beautifulsoup4.
