Python Library: pickle

how to store Python objects to file

30 Sep 2022

I was looking for a better way to save Python objects for tests. Specifically when debugging a scraper where the issue lies in parsing the soup object, ie no need to scrape the site live each run.

Quotes:

The serialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.

Python offers three different modules in the standard library that allow you to serialize and deserialize objects:

  • the marshal module
  • the json module
  • the pickle module

Here are three general guidelines for deciding which approach to use:

  1. Don’t use the marshal module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways.

  2. The json module and XML are good choices if you need interoperability with different languages or a human-readable format.

  3. The Python pickle module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go with pickle.

The Python pickle module basically consists of 4 methods:

Pickling:

  1. pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
  2. pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)

Unpickling:

  1. pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
  2. pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)

s is for string, ie:

  • dump makes bytes file and dumps dumps as string.
  • load loads bytes file and loads loads as string.

Usage

with open()

import pickle

### if RecursionError, uncomment the following and play with value
# import sys
# sys.setrecursionlimit(5000)

[...]

if test:
    pickle_dump = f"test/soup-{slug_from_url(url)}.pkl"
    with open(pickle_dump, "rb") as tp:
        print(f"\nTEST: LOADING FROM PICKLE FILE: {pickle_dump}\n")
        soup = pickle.load(tp)
        # all remaining code logic needs to be here while file is open
        output = process_soup(soup)
else:
    print(f"\nRUNNING LIVE SCRAPING OF {url}\n")
    chrome_options = ChromeOptions()
    chrome_options.add_argument('--headless')

    s = Service(driverpath)
    web = Chrome(service=s,options=chrome_options)
    web.get(url)
    xml = web.page_source
    web.quit()
    soup = BeautifulSoup(xml, features='html.parser')

    with open(f"test/soup-{slug_from_url(url)}.pkl", 'wb') as pf:
        pickle.dump(soup, pf)

or close at the end

easier sometimes for code structure:

if test:
    pickle_dump = f"test/soup-{slug_from_url(url)}.pkl"
    print(f"\nTEST: LOADING FROM PICKLE FILE: {pickle_dump}\n")
    soup_backup = open(pickle_dump, "rb")
    soup = pickle.load(soup_backup)
    # soup object can now be used after if/else statement
else:
    print(f"\nRUNNING LIVE SCRAPING OF {url}\n")
    chrome_options = ChromeOptions()
    chrome_options.add_argument('--headless')

    s = Service(driverpath)
    web = Chrome(service=s,options=chrome_options)
    web.get(url)
    xml = web.page_source
    web.quit()
    soup = BeautifulSoup(xml, features='html.parser')

    # this can stay as with open(), as quick file save with not further action required
    with open(f"test/soup-{slug_from_url(url)}.pkl", 'wb') as pf:
        pickle.dump(soup, pf)

    output = process_soup(soup)

if test:
    # and soup object gets closed here
    soup_backup.close()

how I use it now

05 Oct 2022

import pickle

refresh = True

### Path to pickle file
pickle_path = "test/xxxxx.pickle"

### Generate pickle file
if refresh:
    my_variable = xxxxxxx
    with open(pickle_path, 'wb') as pf:
        pickle.dump(my_variable, pf) 

if not refresh:
    pickle_dump = open(pickle_path, "rb")
    my_variable = pickle.load(pickle_dump)

### End Of File - Close pickle file
if not refresh:
    my_variable.close()

What tripped me up

Recursion Error

Getting error RecursionError: maximum recursion depth exceeded while pickling an object...

solved with increasing the recursion limit:

import sys
sys.setrecursionlimit(5000)

Tests

links

social