Python Library: Beautifulsoup

makes it easy to scrape information from web pages

Library resources
PyPI https://pypi.org/project/beautifulsoup4/
Github ---
Documentation https://beautiful-soup-4.readthedocs.io/en/latest/

Getting started

pip3 install beautifulsoup4

Usage

from bs4 import BeautifulSoup

file = 'path/to/folder/file.html'

soup = BeautifulSoup(open(file), 'html.parser')

books = soup.find_all('h3')

for book in books:

    title = book.text
    print(f"{title=}")

    y = x.contents[0]
    try:
        link = y['href']
    except:
        link = ''
    print(f"{link=}")

    author = x.next_sibling.next_sibling.text
    print(f"{author=}")

see also:

Get all links in a HTML page

Target by ID with wildcard

useful when IDs or classes have random strings within them, eg:
-<div id="content-author-B00CR42MOY" class="information_row">C. S. Lewis</div>
-<div id="content-author-B08CGP9TJ7" class="information_row">Ernest Cline</div>

Snippet:

author = x.find("div", {"id" : lambda L: L and L.startswith('content-author')}).text

Find the next element after a tag

soup.head.next_element.next_element

Find an element based on a custom tag attribute

eg. data-field attribute:

if content.find('a', {'data-field': 'experience_company_logo'}):

Resources

links

social