04 Nov 2022
Library resources | |
---|---|
PyPI | --- |
Github | https://github.com/explosion/spaCy |
Homepage | https://spacy.io/ |
Documentation | https://spacy.io/usage/spacy-101 |
Getting Started
pip install -U pip setuptools wheel
pip install -U 'spacy[apple]'
python -m spacy download en_core_web_trf
Usage
Extracting Entities (company & location) from HTML page:
import spacy
from spacy import displacy
from bs4 import BeautifulSoup
file = 'xxxx.html'
soup = BeautifulSoup(open(file), 'html.parser')
text = soup.get_text()
nlp = spacy.load("en_core_web_trf")
doc = nlp(text)
print(f"\nents:")
for ent in doc.ents:
print(ent.text, ent.label_)
if ent.label_ == 'ORG' or ent.label_ == 'PRODUCT':
print(f"{ent.text=}")
print(f"{ent.label_=}")
if ent.label_ == 'GPE':
print(f"{ent.text=}")
print(f"{ent.label_=}")
print()
Training own model
30 Nov 2022
Standard models do not capture Job Titles or Phone numbers.
Looking into training a model for that.
Annotation tool:
Would need to annotate at least 5-6000 records (ie web pages) though to get significant results - and annotating all entities, not just Job Titles! 😅
Youtube series on training a model with spaCY:
Full series: https://www.youtube.com/watch?v=8HZ4BjWMod4&list=PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo