Open-source (work in progress) at:

GitHub - ndeville/metaurl: Get metadata from URL

github.com

Table of Contents

Background
Goal
Logic
Components
- imports
- clean URL with path
- root website (ie without path)
- domain
- slug
- metadata
  - with request
  - with selenium
- title
- name
- summary
- tags
- emails
- countries
- socials
- logo
Output
To explore
Work in progress
Future
- open-source
- API
  - data sources for paid-for API
    - Clearbit
    - Tuxx
    - CUFinder

29 Sep 2022 starting note.

Background

I deal a lot with URLs in my Python scripts, and it's always painful to piece together several functions I have accumulated over time to get what I need from a URL.

Goal

Fetch as much data as possible from a single URL input.

Aim here is to write an overarching function that does it all and that I can call easily from any script.

Logic

Input: URL

Output: return namedtuple with

clean URL with path
root website (ie without path)
domain
slug
header
title
name
summary
tags
emails
email patterns
phone
facebook
twitter
linkedin
youtube
tiktok
country(ies)
logo
whois data

Use should be as follow:

from meta_url import meta

url = 'https://...'

x = meta(url)

print(x.title)
print(x.slug)
print(x.twitter)
etc...

Components

Tests with:

test_url = 'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1?crid=3BQYT3B98L09M&keywords=dune+kindle&qid=1664442334&qu=eyJxc2MiOiIyLjI0IiwicXNhIjoiMS44OSIsInFzcCI6IjEuODkifQ%3D%3D&sprefix=dune+kindle%2Caps%2C91&sr=8-1'

imports

from urllib import request
from urllib.parse import urlparse
import tldextract
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

clean URL with path

def clean_url(url: str) -> str:
    from urllib.parse import urlparse
    purl = urlparse(url)
    scheme = purl.scheme + '://' if purl.scheme else ''
    return f'{scheme}{purl.netloc}{purl.path}'

outputs:

'https://www.amazon.co.uk/Great-Dune-Trilogy-Children-GOLLANCZ-ebook/dp/B07G17V69X/ref=sr_1_1'

root website (ie without path)

def root_url(url):
    o = urlparse(url)
    root_website = f"{o.scheme}//{o.hostname}".lower()
    return root_website

outputs:

'https//www.amazon.co.uk'

domain

def domain_from_url(url):
    o = tldextract.extract(url)
    domain = f"{o.domain}.{o.suffix}".lower()
    if 'www.' in domain:
        domain = domain.replace('www.','')
    return domain

outputs:

'amazon.co.uk'

slug

def slug_from_url(url):
    o = tldextract.extract(url)
    domain_name = o.domain.lower()
    if 'www.' in domain_name:
        domain_name = domain_name.replace('www.','')
    return domain_name

outputs:

'amazon'

metadata

2 approaches here:

with Selenium (which requires a browser driver)
with simple request

need to test results for both approaches - perfect solution might be using both, starting with one and falling back on the other.

with request

with selenium

See Fetch website metadata from URL

def metadata_from_url_selenium(url, v=False):
    if v:
        print(f"\n#{get_linenumber()} starting metadata_from_url_selenium with {url}")

    chrome_options = ChromeOptions()
    chrome_options.add_argument('--headless')

    s = Service(driverpath)
    web = Chrome(service=s,options=chrome_options)
    web.get(url)
    xml = web.page_source
    if test:
        print(f"\n#{get_linenumber()} {xml=}")
    web.quit()
    soup = BeautifulSoup(xml, features='html.parser')
    if test:
        print(f"\n#{get_linenumber()} {soup=}")
    metas = [x for x in soup.find_all('meta') if x.get('property')]
    if v:
        print(f"\n#{get_linenumber()} metas:\n")
        pp.pprint(metas)

    return [{x.get('property'): x.get('content')} for x in metas] # list of dicts

outputs eg:

[   {'og:site_name': 'xxx'},
    {   'og:title': 'xxx'},
    {   'og:description': 'xxx'},
    {   'og:image': 'https://xxx'},
    {'og:url': 'https://xxx'},
    {'og:locale': 'en_US'},
    {'og:type': 'website'},
    {'og:url': 'https://xxx'},
    {'og:site_name': 'xxx'}
    ]

Note: output depends on meta-tag available on website. Need to implement try/except when using.

title

name

summary

Find AI solution?
Fallback: use title / header.

emails

countries

socials

logo

Using Clearbit API:

Logo API Tool | Find & Embed Company Logos for Free | Clearbit

Find Company Logos

https://clearbit/logo

Output

namedtuple object

+ structured log of the object (JSON)?

To explore

Data APIs

rapidapi

https://rapidapi/category/Data

Work in progress

30 Sep 2022 works currently as follows:

from meta_url import meta

test = meta('https://notes.nicolasdeville.com')

print()
pp.pprint(test)
print() 
print(f"{test.clean_url=}")
print(f"{test.root_website_url=}")
print(f"{test.domain=}")
print(f"{test.slug=}")
print(f"{test.header=}")
print(f"{test.title=}")
print(f"{test.name=}")
print(f"{test.description=}")
print(f"{test.tags=}")
print(f"{test.emails=}")
print(f"{test.twitter=}")
print(f"{test.facebook=}")
print(f"{test.youtube=}")
print(f"{test.tiktok=}")
print(f"{test.countries=}")
print(f"{test.logo=}")

outputs:

metaURL(clean_url='https://notes.nicolasdeville.com', root_website_url='https//notes.nicolasdeville.com', domain='nicolasdeville.com', slug='nicolasdeville', header='Building a static site with Pelican', title='', name='', description='', tags='', emails=[], twitter=['https://www.twitter.com/ndeville'], facebook=[], youtube=[], tiktok=[], countries=[], logo='')

test.clean_url='https://notes.nicolasdeville.com'
test.root_website_url='https//notes.nicolasdeville.com'
test.domain='nicolasdeville.com'
test.slug='nicolasdeville'
test.header='Building a static site with Pelican'
test.title=''
test.name=''
test.description=''
test.tags=''
test.emails=[]
test.twitter=['https://www.twitter.com/ndeville']
test.facebook=[]
test.youtube=[]
test.tiktok=[]
test.countries=[]
test.logo=''

Thinking about breaking it down so that each part of the output is a separate script within the package 🤔

04 Nov 2022

Exploring AI/NLP to extract company name & location with spaCY Python library: spaCy

Future

Though I don't have the skills for it, this could evolve as free open-source and paid-for API.

open-source

free
runs locally, ie requires driver download and scrapes from one's IP
limited to publicly available data, eg domain's website

API

paid-for, as much to build + commercial licenses needed for databases access
runs in the cloud, ie no driver install nore scraping jobs from one's IP
expand data & increase data quality with use of commercial databases

This would be acting as channel partner for existing paid-for API providers with company or website data.

data sources for paid-for API

Clearbit

Tuxx

Company data enrichment API service

tuxx

https://developer.tuxx/api-overview/data-enrichment-company

focus on NL companies only?

CUFinder

Company URL Finder - Convert Company Name to Gold 🥇

Enrich Your Companies' Data with CUF

https://companyurlfinder/

Background

Goal

Logic

Components

imports

clean URL with path

root website (ie without path)

domain

slug

metadata

with request

with selenium

title

name

summary

tags

emails

countries

socials

logo

Output

To explore

Work in progress

Future

open-source

API

data sources for paid-for API

Clearbit

Tuxx

CUFinder

links

social