Open-source (work in progress) at:
29 Sep 2022 starting note.
Background
I deal a lot with URLs in my Python scripts, and it's always painful to piece together several functions I have accumulated over time to get what I need from a URL.
Goal
Fetch as much data as possible from a single URL input.
Aim here is to write an overarching function that does it all and that I can call easily from any script.
Logic
Input: URL
Output: return namedtuple with
- clean URL with path
- root website (ie without path)
- domain
- slug
- header
- title
- name
- summary
- tags
- emails
- email patterns
- phone
- youtube
- tiktok
- country(ies)
- logo
- whois data
Use should be as follow:
from meta_url import meta
url = 'https://...'
x = meta(url)
print(x.title)
print(x.slug)
print(x.twitter)
etc...
Components
Tests with:
test_url = 'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1?crid=3BQYT3B98L09M&keywords=dune+kindle&qid=1664442334&qu=eyJxc2MiOiIyLjI0IiwicXNhIjoiMS44OSIsInFzcCI6IjEuODkifQ%3D%3D&sprefix=dune+kindle%2Caps%2C91&sr=8-1'
imports
from urllib import request
from urllib.parse import urlparse
import tldextract
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
clean URL with path
def clean_url(url: str) -> str:
from urllib.parse import urlparse
purl = urlparse(url)
scheme = purl.scheme + '://' if purl.scheme else ''
return f'{scheme}{purl.netloc}{purl.path}'
outputs:
'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1'
root website (ie without path)
def root_url(url):
o = urlparse(url)
root_website = f"{o.scheme}//{o.hostname}".lower()
return root_website
outputs:
'https//www.amazon.co.uk'
domain
def domain_from_url(url):
o = tldextract.extract(url)
domain = f"{o.domain}.{o.suffix}".lower()
if 'www.' in domain:
domain = domain.replace('www.','')
return domain
outputs:
'amazon.co.uk'
slug
def slug_from_url(url):
o = tldextract.extract(url)
domain_name = o.domain.lower()
if 'www.' in domain_name:
domain_name = domain_name.replace('www.','')
return domain_name
outputs:
'amazon'
metadata
2 approaches here:
- with Selenium (which requires a browser driver)
- with simple request
need to test results for both approaches - perfect solution might be using both, starting with one and falling back on the other.
with request
with selenium
See Fetch website metadata from URL
def metadata_from_url_selenium(url, v=False):
if v:
print(f"\n#{get_linenumber()} starting metadata_from_url_selenium with {url}")
chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
s = Service(driverpath)
web = Chrome(service=s,options=chrome_options)
web.get(url)
xml = web.page_source
if test:
print(f"\n#{get_linenumber()} {xml=}")
web.quit()
soup = BeautifulSoup(xml, features='html.parser')
if test:
print(f"\n#{get_linenumber()} {soup=}")
metas = [x for x in soup.find_all('meta') if x.get('property')]
if v:
print(f"\n#{get_linenumber()} metas:\n")
pp.pprint(metas)
return [{x.get('property'): x.get('content')} for x in metas] # list of dicts
outputs eg:
[ {'og:site_name': 'xxx'},
{ 'og:title': 'xxx'},
{ 'og:description': 'xxx'},
{ 'og:image': 'https://xxx'},
{'og:url': 'https://xxx'},
{'og:locale': 'en_US'},
{'og:type': 'website'},
{'og:url': 'https://xxx'},
{'og:site_name': 'xxx'}
]
Note: output depends on meta-tag available on website. Need to implement try/except when using.
title
name
summary
Find AI solution?
Fallback: use title / header.
tags
Find AI solution?
emails
countries
socials
logo
Using Clearbit API:
Output
namedtuple object
+ structured log of the object (JSON)?
To explore
Work in progress
30 Sep 2022 works currently as follows:
from meta_url import meta
test = meta('https://notes.nicolasdeville.com')
print()
pp.pprint(test)
print()
print(f"{test.clean_url=}")
print(f"{test.root_website_url=}")
print(f"{test.domain=}")
print(f"{test.slug=}")
print(f"{test.header=}")
print(f"{test.title=}")
print(f"{test.name=}")
print(f"{test.description=}")
print(f"{test.tags=}")
print(f"{test.emails=}")
print(f"{test.twitter=}")
print(f"{test.facebook=}")
print(f"{test.youtube=}")
print(f"{test.tiktok=}")
print(f"{test.countries=}")
print(f"{test.logo=}")
outputs:
metaURL(clean_url='https://notes.nicolasdeville.com', root_website_url='https//notes.nicolasdeville.com', domain='nicolasdeville.com', slug='nicolasdeville', header='Building a static site with Pelican', title='', name='', description='', tags='', emails=[], twitter=['https://www.twitter.com/ndeville'], facebook=[], youtube=[], tiktok=[], countries=[], logo='')
test.clean_url='https://notes.nicolasdeville.com'
test.root_website_url='https//notes.nicolasdeville.com'
test.domain='nicolasdeville.com'
test.slug='nicolasdeville'
test.header='Building a static site with Pelican'
test.title=''
test.name=''
test.description=''
test.tags=''
test.emails=[]
test.twitter=['https://www.twitter.com/ndeville']
test.facebook=[]
test.youtube=[]
test.tiktok=[]
test.countries=[]
test.logo=''
Thinking about breaking it down so that each part of the output is a separate script within the package 🤔
04 Nov 2022
Exploring AI/NLP to extract company name & location with spaCY
Python library: spaCy
Future
Though I don't have the skills for it, this could evolve as free open-source and paid-for API.
open-source
- free
- runs locally, ie requires driver download and scrapes from one's IP
- limited to publicly available data, eg domain's website
API
- paid-for, as much to build + commercial licenses needed for databases access
- runs in the cloud, ie no driver install nore scraping jobs from one's IP
- expand data & increase data quality with use of commercial databases
This would be acting as channel partner for existing paid-for API providers with company or website data.
data sources for paid-for API
Clearbit
Tuxx
- focus on NL companies only?