Python Open-Source Project: metaURL

generates metadata for a given URL

Open-source (work in progress) at:

29 Sep 2022 starting note.

Background

I deal a lot with URLs in my Python scripts, and it's always painful to piece together several functions I have accumulated over time to get what I need from a URL.

Goal

Fetch as much data as possible from a single URL input.

Aim here is to write an overarching function that does it all and that I can call easily from any script.

Logic

Input: URL

Output: return namedtuple with

  • clean URL with path
  • root website (ie without path)
  • domain
  • slug
  • header
  • title
  • name
  • summary
  • tags
  • emails
  • email patterns
  • phone
  • facebook
  • twitter
  • linkedin
  • youtube
  • tiktok
  • country(ies)
  • logo
  • whois data

Use should be as follow:

from meta_url import meta

url = 'https://...'

x = meta(url)

print(x.title)
print(x.slug)
print(x.twitter)
etc...

Components

Tests with:

test_url = 'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1?crid=3BQYT3B98L09M&keywords=dune+kindle&qid=1664442334&qu=eyJxc2MiOiIyLjI0IiwicXNhIjoiMS44OSIsInFzcCI6IjEuODkifQ%3D%3D&sprefix=dune+kindle%2Caps%2C91&sr=8-1'

imports

from urllib import request
from urllib.parse import urlparse
import tldextract
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

clean URL with path

def clean_url(url: str) -> str:
    from urllib.parse import urlparse
    purl = urlparse(url)
    scheme = purl.scheme + '://' if purl.scheme else ''
    return f'{scheme}{purl.netloc}{purl.path}'

outputs:

'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1'

root website (ie without path)

def root_url(url):
    o = urlparse(url)
    root_website = f"{o.scheme}//{o.hostname}".lower()
    return root_website

outputs:

'https//www.amazon.co.uk'

domain

def domain_from_url(url):
    o = tldextract.extract(url)
    domain = f"{o.domain}.{o.suffix}".lower()
    if 'www.' in domain:
        domain = domain.replace('www.','')
    return domain

outputs:

'amazon.co.uk'

slug

def slug_from_url(url):
    o = tldextract.extract(url)
    domain_name = o.domain.lower()
    if 'www.' in domain_name:
        domain_name = domain_name.replace('www.','')
    return domain_name

outputs:

'amazon'

metadata

2 approaches here:

  • with Selenium (which requires a browser driver)
  • with simple request

need to test results for both approaches - perfect solution might be using both, starting with one and falling back on the other.

with request

with selenium

See !python/fetch-metadata-from-url

def metadata_from_url_selenium(url, v=False):
    if v:
        print(f"\n#{get_linenumber()} starting metadata_from_url_selenium with {url}")

    chrome_options = ChromeOptions()
    chrome_options.add_argument('--headless')

    s = Service(driverpath)
    web = Chrome(service=s,options=chrome_options)
    web.get(url)
    xml = web.page_source
    if test:
        print(f"\n#{get_linenumber()} {xml=}")
    web.quit()
    soup = BeautifulSoup(xml, features='html.parser')
    if test:
        print(f"\n#{get_linenumber()} {soup=}")
    metas = [x for x in soup.find_all('meta') if x.get('property')]
    if v:
        print(f"\n#{get_linenumber()} metas:\n")
        pp.pprint(metas)

    return [{x.get('property'): x.get('content')} for x in metas] # list of dicts

outputs eg:

[   {'og:site_name': 'xxx'},
    {   'og:title': 'xxx'},
    {   'og:description': 'xxx'},
    {   'og:image': 'https://xxx'},
    {'og:url': 'https://xxx'},
    {'og:locale': 'en_US'},
    {'og:type': 'website'},
    {'og:url': 'https://xxx'},
    {'og:site_name': 'xxx'}
    ]

Note: output depends on meta-tag available on website. Need to implement try/except when using.

title

name

summary

Find AI solution?
Fallback: use title / header.

tags

Find AI solution?

emails

countries

socials

Using Clearbit API:

Output

namedtuple object

+ structured log of the object (JSON)?

To explore

Work in progress

30 Sep 2022 works currently as follows:

from meta_url import meta

test = meta('https://notes.nicolasdeville.com')

print()
pp.pprint(test)
print() 
print(f"{test.clean_url=}")
print(f"{test.root_website_url=}")
print(f"{test.domain=}")
print(f"{test.slug=}")
print(f"{test.header=}")
print(f"{test.title=}")
print(f"{test.name=}")
print(f"{test.description=}")
print(f"{test.tags=}")
print(f"{test.emails=}")
print(f"{test.twitter=}")
print(f"{test.facebook=}")
print(f"{test.youtube=}")
print(f"{test.tiktok=}")
print(f"{test.countries=}")
print(f"{test.logo=}")

outputs:

metaURL(clean_url='https://notes.nicolasdeville.com', root_website_url='https//notes.nicolasdeville.com', domain='nicolasdeville.com', slug='nicolasdeville', header='Building a static site with Pelican', title='', name='', description='', tags='', emails=[], twitter=['https://www.twitter.com/ndeville'], facebook=[], youtube=[], tiktok=[], countries=[], logo='')

test.clean_url='https://notes.nicolasdeville.com'
test.root_website_url='https//notes.nicolasdeville.com'
test.domain='nicolasdeville.com'
test.slug='nicolasdeville'
test.header='Building a static site with Pelican'
test.title=''
test.name=''
test.description=''
test.tags=''
test.emails=[]
test.twitter=['https://www.twitter.com/ndeville']
test.facebook=[]
test.youtube=[]
test.tiktok=[]
test.countries=[]
test.logo=''

Thinking about breaking it down so that each part of the output is a separate script within the package 🤔

04 Nov 2022

Exploring AI/NLP to extract company name & location with spaCY !python/library-spacy

Future

Though I don't have the skills for it, this could evolve as free open-source and paid-for API.

open-source

  • free
  • runs locally, ie requires driver download and scrapes from one's IP
  • limited to publicly available data, eg domain's website

API

  • paid-for, as much to build + commercial licenses needed for databases access
  • runs in the cloud, ie no driver install nore scraping jobs from one's IP
  • expand data & increase data quality with use of commercial databases

This would be acting as channel partner for existing paid-for API providers with company or website data.

data sources for paid-for API

Clearbit

Tuxx

  • focus on NL companies only?

CUFinder

links

social