Python library: spaCy

advanced Natural Language Processing in Python

04 Nov 2022

Library resources
PyPI ---
Github https://github.com/explosion/spaCy
Homepage https://spacy.io/
Documentation https://spacy.io/usage/spacy-101

Getting Started

pip install -U pip setuptools wheel
pip install -U 'spacy[apple]'
python -m spacy download en_core_web_trf

Usage

Extracting Entities (company & location) from HTML page:

import spacy
from spacy import displacy
from bs4 import BeautifulSoup

file = 'xxxx.html'

soup = BeautifulSoup(open(file), 'html.parser')

text = soup.get_text()

nlp = spacy.load("en_core_web_trf")
doc = nlp(text)

print(f"\nents:")
for ent in doc.ents:
    print(ent.text, ent.label_)
    if ent.label_ == 'ORG' or ent.label_ == 'PRODUCT':
        print(f"{ent.text=}")
        print(f"{ent.label_=}")
    if ent.label_ == 'GPE':
        print(f"{ent.text=}")
        print(f"{ent.label_=}")
        print()

Training own model

30 Nov 2022

Standard models do not capture Job Titles or Phone numbers.

Looking into training a model for that.

Annotation tool:

Would need to annotate at least 5-6000 records (ie web pages) though to get significant results - and annotating all entities, not just Job Titles! 😅

Youtube series on training a model with spaCY:

Full series: https://www.youtube.com/watch?v=8HZ4BjWMod4&list=PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo

links

social