metaURL

Python tool to extract metadata, contacts, and social profiles from any website URL.

12 Apr 2026 an old personal project I need to resume at some point..

Overview

metaURL is a Python-based web scraping tool that pulls structured metadata from any website. Give it a URL and it returns a namedtuple with 27 fields covering page metadata, contact info, social profiles, company details, and more. Built with Selenium for headless browsing, BeautifulSoup for HTML parsing, and spaCy for entity recognition.

Developed as part of BtoBSales.EU for enriching lead data from websites automatically.

How it works

The core meta() function takes a URL and runs a multi-step pipeline:

  1. Selenium opens the page in a headless browser (handles JS-rendered content).
  2. BeautifulSoup parses the HTML to extract meta tags, headings, links.
  3. spaCy runs NER to identify people and organisations in the page text.
  4. Fuzzy matching compares detected names against the domain to find the company name.
  5. Contact pages, social links, and email patterns are discovered by crawling internal links.

Data fields

The returned namedtuple includes 27 fields:

Category Fields
Metadata title, description, h1, keywords, logo
Contact emails, phone, contact_pages, email_patterns
Company company name, team members, countries
Social Facebook, LinkedIn, Twitter, TikTok, Medium, YouTube
Technical internal links, downloadable files, domain info, WHOIS

Dependencies

pip install selenium beautifulsoup4 spacy
python -m spacy download en_core_web_sm

Selenium requires a browser driver (ChromeDriver or GeckoDriver) on the system PATH.

Known limitations

  • Country detection is tricky. Mentions of jurisdictions in legal/privacy pages get confused with actual headquarters location.
  • Email pattern detection needs refinement for edge cases.
  • Work in progress. The README flags several areas under active development.

⚠️ WARNING: Headless browser scraping can be slow and memory-intensive. Not suited for scraping thousands of URLs in quick succession without connection pooling or rate limiting.

links

social