Library resources | |
---|---|
PyPI | https://pypi.org/project/beautifulsoup4/ |
Github | --- |
Documentation | https://beautiful-soup-4.readthedocs.io/en/latest/ |
Getting started
pip3 install beautifulsoup4
Usage
from bs4 import BeautifulSoup
file = 'path/to/folder/file.html'
soup = BeautifulSoup(open(file), 'html.parser')
books = soup.find_all('h3')
for book in books:
title = book.text
print(f"{title=}")
y = x.contents[0]
try:
link = y['href']
except:
link = ''
print(f"{link=}")
author = x.next_sibling.next_sibling.text
print(f"{author=}")
see also:
Target by ID with wildcard
useful when IDs or classes have random strings within them, eg:
-<div id="content-author-B00CR42MOY" class="information_row">C. S. Lewis</div>
-<div id="content-author-B08CGP9TJ7" class="information_row">Ernest Cline</div>
Snippet:
author = x.find("div", {"id" : lambda L: L and L.startswith('content-author')}).text
Find the next element after a tag
soup.head.next_element.next_element
Find an element based on a custom tag attribute
eg. data-field
attribute:
if content.find('a', {'data-field': 'experience_company_logo'}):