30 Sep 2022
I was looking for a better way to save Python objects for tests. Specifically when debugging a scraper where the issue lies in parsing the soup object, ie no need to scrape the site live each run.
Quotes:
The serialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.
Python offers three different modules in the standard library that allow you to serialize and deserialize objects:
- the
marshal
module - the
json
module - the
pickle
module
Here are three general guidelines for deciding which approach to use:
-
Don’t use the
marshal
module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways. -
The
json
module and XML are good choices if you need interoperability with different languages or a human-readable format. -
The Python
pickle
module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go with pickle.
The Python pickle module basically consists of 4 methods:
Pickling:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
Unpickling:
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
s
is for string, ie:
dump
makes bytes file anddumps
dumps as string.load
loads bytes file andloads
loads as string.
Usage
with open()
import pickle
### if RecursionError, uncomment the following and play with value
# import sys
# sys.setrecursionlimit(5000)
[...]
if test:
pickle_dump = f"test/soup-{slug_from_url(url)}.pkl"
with open(pickle_dump, "rb") as tp:
print(f"\nTEST: LOADING FROM PICKLE FILE: {pickle_dump}\n")
soup = pickle.load(tp)
# all remaining code logic needs to be here while file is open
output = process_soup(soup)
else:
print(f"\nRUNNING LIVE SCRAPING OF {url}\n")
chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
s = Service(driverpath)
web = Chrome(service=s,options=chrome_options)
web.get(url)
xml = web.page_source
web.quit()
soup = BeautifulSoup(xml, features='html.parser')
with open(f"test/soup-{slug_from_url(url)}.pkl", 'wb') as pf:
pickle.dump(soup, pf)
or close at the end
easier sometimes for code structure:
if test:
pickle_dump = f"test/soup-{slug_from_url(url)}.pkl"
print(f"\nTEST: LOADING FROM PICKLE FILE: {pickle_dump}\n")
soup_backup = open(pickle_dump, "rb")
soup = pickle.load(soup_backup)
# soup object can now be used after if/else statement
else:
print(f"\nRUNNING LIVE SCRAPING OF {url}\n")
chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
s = Service(driverpath)
web = Chrome(service=s,options=chrome_options)
web.get(url)
xml = web.page_source
web.quit()
soup = BeautifulSoup(xml, features='html.parser')
# this can stay as with open(), as quick file save with not further action required
with open(f"test/soup-{slug_from_url(url)}.pkl", 'wb') as pf:
pickle.dump(soup, pf)
output = process_soup(soup)
if test:
# and soup object gets closed here
soup_backup.close()
how I use it now
05 Oct 2022
import pickle
refresh = True
### Path to pickle file
pickle_path = "test/xxxxx.pickle"
### Generate pickle file
if refresh:
my_variable = xxxxxxx
with open(pickle_path, 'wb') as pf:
pickle.dump(my_variable, pf)
if not refresh:
pickle_dump = open(pickle_path, "rb")
my_variable = pickle.load(pickle_dump)
### End Of File - Close pickle file
if not refresh:
my_variable.close()
What tripped me up
Recursion Error
Getting error RecursionError: maximum recursion depth exceeded while pickling an object
...
solved with increasing the recursion limit:
import sys
sys.setrecursionlimit(5000)