Linkedinee

a LinkedIn data extractor toolbox.

15 Jan 2023

Background

As I'm trying to find my next gig, I turn to Linkedin to help me network and find the right opportunity.

I have had a Linkedin Sales Navigator account for a while, each time paid by the company I was working for.

Emerging from my sabbatical, I have subscribed to Linkedin Premium (the Job Seeker version).

In theory this means I can access all the data I need to find the right opportunity.

But in practice, the UX & limitations (300 max profile visits per days? 500?) makes it a challenge to find & keep track of my job search activities.

Ultimately what I need - now or in the future for Sales activities - is a way for me to process the data on my own terms (ie in my own systems) vs. being imposed the cumbersome UI/UX of Linkedin.

I have played with Linkedin automation tools in the past, but got flagged by Linkedin and had to stop.
There are many tools these days, but they all have the same issue: they violate Linkedin's terms of service, and can be expensive.

The reason Linkedin's terms prevent from automating the data extraction is because the data is their asset (though it should not be - it's the users' data) and they need Linkedin pages to be visited (ie displayed on screen) so as to show ads, which they make money on.

I believe that should not be the case considering I'm paying for the service, but that's another story.

API

Linkedin's API is gated (requires a developer account subject to their review) and the data is very limited.

So I decided to build my own solution.

Goal

My solution is based on ensuring pages are visited "as normal" (on my screen, using a non-automation controlled browser) so that they get their ads revenues.

Noting also that I usually visit Linkedin manually with the Brave browser !apps/brave , which blocks ads.

Here I will automate the visit of the Linkedin pages as if it was me, using Chrome (ie with ads display), and automate the data extraction.

Ultimately what I want is the ability to have a function that takes a company website URL and outputs a list of employees (or defined key people, eg CEO) to my data repository where I can manipulate the data at ease, eg:

  • filter by any criteria
  • keep track of my progress (eg job search or sales prospecting)

At the same time visiting those people's profiles as a "soft touch" (ie they see my profile in their "who viewed your profile" list).

Linkedin get their ads revenues + my subscription fees. And I use the data only for my own needs, so in line with their terms.

Risks

Risk is to get flagged by Linkedin, with account restrictions or even account suspension.

However, considering the following, I hope the risks are limited:

  • I'm not automating the visit of pages by clicking on links
  • I will stay under the daily visits limit (see below)
  • I'm paying for a subscription + let their ads display (double revenue for Linkedin) which I see.
  • not using the data for spam, only targeted outreach (that's the whole point of the automation, otherwise tedious manual work to target precisely)
  • Connects & InMails are sent manually (manual/high-touch time investment here, where it matters, instead of at the data stage)

This project requires some tech-savviness & Python knowledge, it is not for everyone.

Limits

Visits

Rule of thumb: don't view more pages that a professional (or VA) could do manually in a day.

  • 300 max profile visits per day? 500?

Connects

no more than 3–5% of your total LinkedIn connections (https://support.dux-soup.com/article/45-how-many-connection-requests-to-send).
em

Structure

  • Chrome extension to download pages as HTML files upon visit (ie page load)
  • Python script to visit a list of pages (ie profiles), both Company and People
  • Python script to extract data from the HTML files

My data repository is Grist !apps/grist.

Chrome extension (download pages as HTML files)

See !helpers/chrome-extension

Chrome extensions require simply a folder containing at least 2 files:
- manifest.json
- background.js

background.js

for Linkedin Company profiles

chrome.runtime.onMessage.addListener(function(request, sender, sendResponse) {
    if(request.message=="download"){
html = request.html;
var dataUrl = 'data:text/html,' + encodeURIComponent(html);
filename = new URL(request.urlo)
filename = filename.href.replace(filename.origin,"").replace(/[^a-zA-Z0-9]/g, '_');
chrome.downloads.download({
    'url': dataUrl,
    'filename': "Linkedin/companies/"+filename+".html",
    'saveAs': false
});
}
}) 

chrome.tabs.onUpdated.addListener(function(tabId, changeInfo, tab) {
    if (tab.status == 'complete' && tab.url.search("chrome://")<0) {

urlo  = tab.url
if(urlo.search("linkedin.com/company/")>=0){
    chrome.scripting.executeScript({ target: { tabId }, func:runner,args:[urlo]});
}
}

function runner(urlo){

    if(typeof(exist)=="undefined"){
      setTimeout(function(){
        chrome.runtime.sendMessage({"message":"download","html":document.body.innerHTML,urlo:urlo})

      },4000)
  exist = true
    }

   } 
  });

for Linkedin People profiles

I use a second Chrome extension for People profiles, with the same background.js as above, just changing these lines:

Different save folder in my Downloads folder: 'filename': "Linkedin/people/"+filename+".html",

Save HTML based on page URL: if(urlo.search("linkedin.com/in/")>=0){

manifest.json

for Linkedin Company profiles

{
    "name": "Save Linkedin Companies",
    "version": "3.0",
    "manifest_version": 3,
    "description": "saves automatically Linkedin Company profiles visited to given folder as HTML file",
    "icons": { 
        "16": "icon.png",
        "32": "icon.png",
        "48": "icon.png",
        "128": "icon.png"
},

    "background": {
        "service_worker": "background.js"
},

    "host_permissions": ["<all_urls>"],


"permissions":[
    "downloads",
    "scripting",
    "tabs"
]
}

for Linkedin People profiles

Same as above, just changing the name. The difference will be only in the background.js file triggered by the extension.

TODO - figure out how to have them both in a single extension (not critical at this stage).

Python scripts (visit profiles & extract data)

Visitee

####################
# Visit Linkedin profiles

test = False
v = False # verbose mode

### Script-specific imports

import webbrowser
import random
# from random import randint, uniform
import visited
from my_utils import linkedin_handle_from_url, linkedin_url_from_handle
import add_to_Grist_People
import add_to_Grist_Companies

### Global Variables

DB = 'RECRUITERS' # 'STARTUPS' or 'VCs' or 'RECRUITERS'
VISIT = 'PERSON' # 'COMPANY' or 'PERSON'

count_to_visit = random.randint(69, 93)
if test:
    count_to_visit = 2
print(f"\ncount_to_visit: {count_to_visit}")

wait_from = 13.5
wait_to = 27.8

count_visited = 0

count_visited_today = visited.people_today()

### Functions

# def linkedin_handle_from_url(url):
#     # People LinkedIn URLs
#     if 'linkedin.com/in/' in url:
#         handle = url.split('linkedin.com/in/')[1]
#     # Company LinkedIn URLs
#     elif 'linkedin.com/company/' in url:
#         handle = url.split('linkedin.com/company/')[1]
#     elif 'linkedin.com/showcase/' in url:
#         handle = url.split('linkedin.com/showcase/')[1]
#     # Process handle
#     if '?' in handle:
#         handle = handle.split('?')[0]
#     if '/' in handle:
#         handle = handle.replace('/', '')
#     return handle

# def linkedin_url_from_handle(linkedin_handle):
#     return f'https://www.linkedin.com/in/{linkedin_handle}'

### Main

print(f"\nLoading data for {DB}...")
if DB == 'STARTUPS':
    data_companies = grist_BB.Startups.fetch_table('Master')
    data_people = grist_BB.Startups.fetch_table('Linkedins')
elif DB == 'VCs':
    data_companies = grist_BB.VCs.fetch_table('Master')
    data_people = grist_BB.VCs.fetch_table('Linkedins')
elif DB == 'RECRUITERS':
    data_companies = grist_BB.Recruiters.fetch_table('Master')
    data_people = grist_BB.Recruiters.fetch_table('Linkedins')
print(f"{len(data_companies)} companies")
print(f"{len(data_people)} people")

print(f"\nStarting Visitee {VISIT} {DB}...")

if VISIT == 'PERSON':

    data_companies_to_visit = [
        x.id for x in data_companies
        # if x.yes == True
        # if 'VE_Providers' in x.notes
        # if 'US' in x.gristHelper_Display2
        if not x.discard
        ] 
    data_linkedins_to_visit = {
        x.linkedin:x.id for x in data_people
        if x.domain in data_companies_to_visit
        and x.linkedin not in [None, '']
        and not x.downloaded
        and not x.discard
        # and x.PI == True
        }
    to_do = len(data_linkedins_to_visit)
    if to_do < count_to_visit:
        count_to_visit = to_do

    print(f"\n\n{to_do} PERSON Linkedin profiles to visit from {DB}.\n\n")

    ### Functions


    ### Main

    if to_do > 0:

        for linkedin,id in data_linkedins_to_visit.items():

            if count_visited < count_to_visit:

                count_visited += 1
                count_visited_today += 1

                to_do = to_do - 1

                print(f"\n{count_visited}/{count_to_visit} ({to_do} total left / {count_visited_today} visited today) {linkedin}")

                update_data = [{   
                                    'id': id,
                                    'visited': ts_db,
                                    'downloaded': True,
                                    }]

                if not test:

                    webbrowser.get('chrome').open_new_tab(linkedin)

                    if DB == 'STARTUPS':
                        grist_BB.Startups.update_records('Linkedins', update_data)
                    elif DB == 'VCs':
                        grist_BB.VCs.update_records('Linkedins', update_data)
                    elif DB == 'RECRUITERS':
                        grist_BB.Recruiters.update_records('Linkedins', update_data)

                    wait = random.uniform(wait_from, wait_to)
                    print(f"waiting {round(wait, 1)} seconds")
                    for w in range(1,int(wait) + 1):
                        print("\r" + str(w), end='')
                        time.sleep(1)

                print()

        print(f"\nLEFT TO DO: {to_do}")

        ### ADD TO GRIST

        add_to_Grist_People.process()

        ### CHECK VISITED

        if not test:
            visited.people_today()

if VISIT == 'COMPANY':

    data_companies_to_visit = {
        x.linkedin:x.id for x in data_companies
        if x.linkedin not in [None, '']
        and not '/showcase/' in x.linkedin
        # and x.yes == True
        # and not x.ni
        and not x.downloaded
        # and 'VE_Providers' in x.notes
        }
    to_do = len(data_companies_to_visit)
    if to_do < count_to_visit:
        count_to_visit = to_do

    print(f"\n\n{to_do} COMPANY Linkedin profiles to visit from {DB}.\n\n")

    if to_do > 0:

        ### Main

        for linkedin_url,id in data_companies_to_visit.items():
            if v:
                print(f"\n{linkedin_url=}")

            if count_visited < count_to_visit:

                count_visited += 1
                count_visited_today += 1

                to_do = to_do - 1

                # print(f"\n{count_visited}/{count_to_visit} ({to_do} total left / {count_visited_today} visited today) {linkedin_url}")

                linkedin_handle = linkedin_handle_from_url(linkedin_url)
                if v:
                    print(f"{linkedin_handle=}")

                if DB in ['STARTUPS']:
                    linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/?keywords=ceo"
                if DB in ['RECRUITERS']:
                    linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/"
                elif DB in ['VCs']:
                    linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/?keywords=partner"

                print(f"\n{count_visited}/{count_to_visit} ({to_do} total left) {linkedin}")

                update_data = [{   
                                'id': id,
                                'downloaded': True,
                                }]

                if not test:

                    webbrowser.get('chrome').open_new_tab(linkedin)

                    if DB == 'STARTUPS':
                        grist_BB.Startups.update_records('Master', update_data)
                    elif DB == 'VCs':
                        grist_BB.VCs.update_records('Master', update_data)
                    elif DB == 'RECRUITERS':
                        grist_BB.Recruiters.update_records('Master', update_data)

                    wait = random.uniform(wait_from, wait_to)
                    print(f"waiting {round(wait, 1)} seconds")
                    for w in range(1,int(wait) + 1):
                        print("\r" + str(w), end='')
                        time.sleep(1)

                print()

        ### ADD TO GRIST

        # First Companies
        add_to_Grist_Companies.process()
        # Then People extracted from those companies
        add_to_Grist_People.process()

        ### CHECK VISITED

        if not test:
            visited.people_today()

Extractee

Companies

####################
# Add scraped data from Linkedin COMPANY pages to Grist

### Script-specific imports

from extract_data import company_profile
# import requests
from tqdm import tqdm
import shutil
import webbrowser
from random import randint

### Functions

def fetch_file_paths(folder_path):
    global test
    file_paths = []
    for root, dirs, files in os.walk(folder_path):
        if root == folder_path: # remove to also crawl subfolders
            for file in files:
                file_path = os.path.join(root, file)
                file_paths.append(file_path)
    if test: 
        file_paths = file_paths[:10]
    return file_paths

def linkedin_handle_from_url(url, v=False):
    # People LinkedIn URLs
    if 'linkedin.com/in/' in url:
        handle = url.split('linkedin.com/in/')[1]
    # Company LinkedIn URLs
    elif 'linkedin.com/company/' in url:
        handle = url.split('linkedin.com/company/')[1]
    elif 'linkedin.com/showcase/' in url:
        handle = url.split('linkedin.com/showcase/')[1]
    # Process handle
    if '?' in handle:
        handle = handle.split('?')[0]
    if '/' in handle:
        handle = handle.replace('/', '')
    return handle

def linkedin_link_from_file_path(file_path, v=False):
    if v:
        print(f"{file_path=}")
    # Recreate the LinkedIn URL based on file name
    linkedin_link = file_path.replace('/Users/xxx/Downloads/Linkedin/companies/_company_', 'https://www.linkedin.com/company/')
    if '_people__keywords_ceo.html' in linkedin_link:
        linkedin_link = linkedin_link.replace('_people__keywords_ceo.html', '')
    if '_people__keywords_partner.html' in linkedin_link:
        linkedin_link = linkedin_link.replace('_people__keywords_partner.html', '')
    if '_people__keywords_saas.html' in linkedin_link:
        linkedin_link = linkedin_link.replace('_people__keywords_saas.html', '')
    if '_people_.html' in linkedin_link:
        linkedin_link = linkedin_link.replace('_people_.html', '')
    # TODO make it more generic to capture any keyword
    linkedin_link = linkedin_link.replace('.html', '')
    if linkedin_link.endswith('--'):
        linkedin_link = linkedin_link[ : -2 ]
    if linkedin_link.endswith('-'):
        linkedin_link = linkedin_link[ : -1 ]
    # if v:
    #     print(f'\n{loc} func(linkedin_link_from_file_path) #{ln()}: {linkedin_link=}')
    return linkedin_link

### Main

#### COMPANIES

def process():

    count = 0
    processed_files = []
    error_files = []
    error_handles = []

    count_total = 0
    count_matches = 0

    company_categories_found = set()

    lc_vcs_linkedins = []
    lc_recruiters_linkedins = []
    lc_startups_linkedins = []

    print(f"\n============ STARTING add_to_Grist_Companies\n")

    companies_folder_path = '/Users/xxx/Downloads/Linkedin/companies'

    #### Dicts of Grist COMPANY Linkedin handles (per Grist doc), as UID

    grist_recruiters_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.Recruiters.fetch_table('Master') if x.linkedin not in [None, '']}
    print(f"Fetched {len(grist_recruiters_companies)} Recruiters Company profiles from Grist.")
    grist_vcs_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.VCs.fetch_table('Master') if x.linkedin not in [None, '']}
    print(f"Fetched {len(grist_vcs_companies)} VCs Company profiles from Grist.")
    grist_startups_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.Startups.fetch_table('Master') if x.linkedin not in [None, '']}
    print(f"Fetched {len(grist_startups_companies)} Startups Company profiles from Grist.\n")

    #### Dicts of Grist PERSON Profiles (per Grist doc), as UID

    grist_recruiters_linkedins = {
                                    linkedin_handle_from_url(x.linkedin):int(x.id)
                                    for x in grist_BB.Recruiters.fetch_table('Linkedins')
                                    if x.linkedin not in [None, '']
                                    }
    print(f"Fetched {len(grist_recruiters_linkedins)} Recruiters Linkedin profiles from Grist.")

    grist_vcs_linkedins = {
                            linkedin_handle_from_url(x.linkedin):int(x.id)
                            for x in grist_BB.VCs.fetch_table('Linkedins')
                            if x.linkedin not in [None, '']
                            }
    print(f"Fetched {len(grist_vcs_linkedins)} VCs Linkedin profiles from Grist.")

    grist_startups_linkedins = {
                            linkedin_handle_from_url(x.linkedin):int(x.id)
                            for x in grist_BB.Startups.fetch_table('Linkedins')
                            if x.linkedin not in [None, '']
                            }
    print(f"Fetched {len(grist_startups_linkedins)} Startups Linkedin profiles from Grist.\n")

    #### Dict of Scraped profiles

    # def process_companies_files(): # TODO make function so as to be called again later with errors

    file_paths = fetch_file_paths(companies_folder_path)

    dict_scraped_vcs = {}
    dict_scraped_recruiters = {}
    dict_scraped_startups = {}

    for file_path in tqdm(file_paths):
        count_total += 1
        # print(f"{file_path=}")
        if file_path.endswith((".html")):
            if '(' not in file_path:
                if '_people_' in file_path:
                    count_matches += 1
                    if v:
                        print(f"\n\n#######################\n{count_matches}\n{file_path=}")

                    linkedin_link = linkedin_link_from_file_path(file_path, v=v)

                    linkedin_handle = linkedin_handle_from_url(linkedin_link, v=v)

                    try:
                        # Scrape the LinkedIn profile with extract_data.py
                        company_profile_data = company_profile(file_path, v=v)

                        # Categorisation
                        category = company_profile_data['category']
                        company_categories_found.add(category) # for overview of all categories found
                        if 'Recruiting' in category:
                            classify_as = 'Recruiter'
                        elif 'Venture Capital' in category:
                            classify_as = 'VC'
                        else:
                            classify_as = 'Startup'

                        # Add to dicts per Grist Doc with Linkedin handle as key
                        if classify_as == 'Recruiter':
                            dict_scraped_recruiters[linkedin_handle] = company_profile_data
                        elif classify_as == 'VC':
                            dict_scraped_vcs[linkedin_handle] = company_profile_data
                        elif classify_as == 'Startup':
                            dict_scraped_startups[linkedin_handle] = company_profile_data

                        processed_files.append(file_path)

                    except:
                        print(f"Error with handle {linkedin_handle}: removing file {file_path}")
                        error_handles.append(linkedin_handle)
                        os.remove(file_path)
                else:
                    os.remove(file_path)

            else:
                os.remove(file_path)


    #### Process to Grist

    print(f"\n\nProcessing scraped data:\n{len(dict_scraped_recruiters)} Recruiters\n{len(dict_scraped_vcs)} VCs\n{len(dict_scraped_startups)} Startups")

    # processed_startups = []

    # print()
    # pp.pprint(dict_scraped_startups)
    # print()

    ### STARTUPS
    for linkedin_company_handle, data in dict_scraped_startups.items():
        if v:
            print(f"\n{linkedin_company_handle=}")
            print(f"\ndata={pp.pprint(data)}")
        file_path = data['file_path']

        # PEOPLE DATA
        people = data['people']
        for l,p in people.items():
            if v:
                print(f"\n{l=}")
                print(f"\n{p=}")
            linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
            if '%' not in linkedin_handle:

                connection = p['connection']
                new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
                name = p['name']
                title = p['title']

                person_data = {   
                            'connection': connection,
                            'linkedin': new_linkedin_link,
                            'full_name': name,
                            'headline': title,
                            'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
                            'created': ts_db,
                        }

                ## PREPARE TO ADD TO GRIST

                if v:
                    print(f"\n{linkedin_handle=}\n")

                ### Startups
                if linkedin_company_handle in grist_startups_companies:
                    if linkedin_handle not in grist_startups_linkedins:
                        # Get the Foreign Key for the Master record to associate People with
                        fk = grist_startups_companies[linkedin_company_handle] # int x.id
                        # Update person_data object with Foreign Key
                        person_data['domain'] = fk
                        # Append to Startups list to create in Grist
                        lc_startups_linkedins.append(person_data)
                        print(f"Added {name} / {new_linkedin_link} to Startups")

            # COMPANY DATA
            # TODO goal: update Master record with data from Company Linkedin profile

            processed_files.append(file_path)

    ### RECRUITERS
    for linkedin_company_handle, data in dict_scraped_recruiters.items():
        if v:
            print(f"\n{linkedin_company_handle=}")
            print(f"\ndata={pp.pprint(data)}")
        file_path = data['file_path']

        # PEOPLE DATA
        people = data['people']
        for l,p in people.items():
            if v:
                print(f"\n{l=}")
                print(f"\n{p=}")
            linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
            if '%' not in linkedin_handle:

                connection = p['connection']
                new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
                name = p['name']
                title = p['title']

                person_data = {   
                            'connection': connection,
                            'linkedin': new_linkedin_link,
                            'full_name': name,
                            'headline': title,
                            'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
                            'created': ts_db,
                        }

                ## PREPARE TO ADD TO GRIST

                if v:
                    print(f"\n{linkedin_handle=}\n")

                if linkedin_company_handle in grist_recruiters_companies:
                    if linkedin_handle not in grist_recruiters_linkedins:
                        fk = grist_recruiters_companies[linkedin_company_handle] # int x.id
                        person_data['domain'] = fk
                        lc_recruiters_linkedins.append(person_data)
                        print(f"Added {name} / {new_linkedin_link} to recruiters")

            # COMPANY DATA
            # TODO goal: update Master record with data from Company Linkedin profile

            processed_files.append(file_path)


    ### VCs
    for linkedin_company_handle, data in dict_scraped_vcs.items():
        if v:
            print(f"\n{linkedin_company_handle=}")
            print(f"\ndata={pp.pprint(data)}")
        file_path = data['file_path']

        # PEOPLE DATA
        people = data['people']
        for l,p in people.items():
            if v:
                print(f"\n{l=}")
                print(f"\n{p=}")
            linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
            if '%' not in linkedin_handle:

                connection = p['connection']
                new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
                name = p['name']
                title = p['title']

                person_data = {   
                            'connection': connection,
                            'linkedin': new_linkedin_link,
                            'full_name': name,
                            'headline': title,
                            'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
                            'created': ts_db,
                        }

                ## PREPARE TO ADD TO GRIST

                if v:
                    print(f"\n{linkedin_handle=}\n")

                if linkedin_company_handle in grist_vcs_companies:
                    if linkedin_handle not in grist_vcs_linkedins:
                        fk = grist_vcs_companies[linkedin_company_handle] # int x.id
                        person_data['domain'] = fk
                        lc_vcs_linkedins.append(person_data)
                        print(f"Added {name} / {new_linkedin_link} to vcs")

            # COMPANY DATA
            # TODO goal: update Master record with data from Company Linkedin profile

            processed_files.append(file_path)


    # ADD TO GRIST
    if test:
        print(f"\n\nTEST:")
        print(f"{len(lc_startups_linkedins)} Startups Linkedins to Grist:")
        print(f"{len(lc_vcs_linkedins)} VCs Linkedins to Grist:")
        print(f"{len(lc_recruiters_linkedins)} Recruiters Linkedins to Grist:")
    else:
        if len(lc_startups_linkedins) > 0:
            grist_BB.Startups.add_records('Linkedins', lc_startups_linkedins)
            print(f"\n\nADDED {len(lc_startups_linkedins)} Startups Linkedins profiles to Grist")
        if len(lc_vcs_linkedins) > 0:
            grist_BB.VCs.add_records('Linkedins', lc_vcs_linkedins)
            print(f"ADDED {len(lc_vcs_linkedins)} VCs Linkedins profiles to Grist")
        if len(lc_recruiters_linkedins) > 0:
            grist_BB.Recruiters.add_records('Linkedins', lc_recruiters_linkedins)
            print(f"ADDED {len(lc_recruiters_linkedins)} Recruiters Linkedins profiles to Grist")

    # MOVE PROCESSED FILES
    print()
    if not test:
        for pf in sorted(set(processed_files)):
            file_name = os.path.basename(pf)
            # print(f"{pf=}")
            # print(f"{file_name=}")
            if os.path.exists(f'/Users/xxx/Downloads/Linkedin/companies/processed/{file_name}'):
                os.remove(f'/Users/xxx/Downloads/Linkedin/companies/processed/{file_name}')
            shutil.move(pf, '/Users/xxx/Downloads/Linkedin/companies/processed')
        print(f"\n\n============ Moved {len(processed_files)} scraped files to processed folder.\n")

    # CATEGORIES OF COMPANIES PROCESSED (for classification tweaking/troubleshooting)
    print(f"\nFYI - COMPANY CATEGORIES FOUND:")
    pp.pprint(company_categories_found)
    print()

    if len(error_handles) > 0:
        # print(f"\n{len(error_handles)} ERROR WITH FILES - opening them now on People section...")
        print(f"\n{len(error_handles)} ERROR WITH FILES:")
        for handle in error_handles:
            link_to_open = f"https://www.linkedin.com/company/{handle}/people/"
            print(f"{link_to_open}")
            # # 230113 Stopped visiting automatically as most are dead links
            # if handle.strip() not in ['unavailable','404']:

            #     print(f"Opening {link_to_open}")
            #     webbrowser.get('chrome').open_new_tab(link_to_open)

            #     wait = randint(12, 29)

            #     print(f"waiting {wait} seconds")

            #     for w in range(1,wait + 1):
            #         # print(w, end=" ", flush=True)
            #         print("\r" + str(w), end='')
            #         time.sleep(1)
            #     print()

    print(f"\n\n{count_total=}")
    print(f"{count_matches=}")
    # print(f"{count=}")


    # TODO manage errors by visiting standard Company link & doing a second pass (make function first)

People

### Script-specific imports

from extract_data import linkedin_profile, company_profile
import requests
from tqdm import tqdm
import shutil

### Global Variables

# RUN = 'PEOPLE' # 'PEOPLE' or 'COMPANIES' or 'ALL'

folder_path = '/Users/xxx/Downloads/Linkedin/profiles'

missing_experiences = []

processed_files = []

### Functions

def fetch_file_paths(folder_path):
    global test
    file_paths = []
    for root, dirs, files in os.walk(folder_path):
        if root == folder_path: # remove to also crawl subfolders
            for file in files:
                file_path = os.path.join(root, file)
                file_paths.append(file_path)
    if test: 
        file_paths = file_paths[:10]
    return file_paths

def linkedin_handle_from_url(url):
    # People LinkedIn URLs
    if 'linkedin.com/in/' in url:
        handle = url.split('linkedin.com/in/')[1]
    # Company LinkedIn URLs
    elif 'linkedin.com/company/' in url:
        handle = url.split('linkedin.com/company/')[1]
    elif 'linkedin.com/showcase/' in url:
        handle = url.split('linkedin.com/showcase/')[1]
    # Process handle
    if '?' in handle:
        handle = handle.split('?')[0]
    if '/' in handle:
        handle = handle.replace('/', '')
    return handle

### Main

#### PEOPLE

def process():
    global count

    print(f"\n============ STARTING add_to_Grist_People\n")

    # Dicts of Grist profiles (per Grist doc)

    grist_recruiters_linkedins = {}
    for x in grist_BB.Recruiters.fetch_table('Linkedins'):
        if x.linkedin.endswith('/'):
            linkedin = x.linkedin[:-1]
        else:
            linkedin = x.linkedin
        grist_recruiters_linkedins[linkedin] = int(x.id)
    print(f"Fetched {len(grist_recruiters_linkedins)} Recruiters Linkedin profiles from Grist.")

    grist_vcs_linkedins = {}
    for x in grist_BB.VCs.fetch_table('Linkedins'):
        if x.linkedin.endswith('/'):
            linkedin = x.linkedin[:-1]
        else:
            linkedin = x.linkedin
        grist_vcs_linkedins[linkedin] = int(x.id)
    print(f"Fetched {len(grist_vcs_linkedins)} VCs Linkedin profiles from Grist.")

    grist_startups_linkedins = {}
    for x in grist_BB.Startups.fetch_table('Linkedins'):
        if x.linkedin.endswith('/'):
            linkedin = x.linkedin[:-1]
        else:
            linkedin = x.linkedin
        grist_startups_linkedins[linkedin] = int(x.id)
    print(f"Fetched {len(grist_startups_linkedins)} Startups Linkedin profiles from Grist.\n")

    # Dict of Scraped profiles

    file_paths = fetch_file_paths(folder_path)

    dict_scraped = {}

    for file_path in tqdm(file_paths):
        if file_path.endswith((".html")):
            if '(' not in file_path:
                count += 1
                if v:
                    print(f"\n\n#######################\n{count}\n{file_path=}")

                # Recreate the LinkedIn URL based on file name
                linkedin_link = file_path.replace('/Users/xxx/Downloads/Linkedin/profiles/_in_', 'https://www.linkedin.com/in/')
                if 'originalSubdomain' in linkedin_link:
                    linkedin_link = linkedin_link.split('originalSubdomain')[0]
                if '_' in linkedin_link:
                    linkedin_link = linkedin_link.replace('_', '-')
                linkedin_link = linkedin_link.replace('-.html', '')
                if linkedin_link.endswith('--'):
                    linkedin_link = linkedin_link[ : -2 ]
                if linkedin_link.endswith('-'):
                    linkedin_link = linkedin_link[ : -1 ]
                if v:
                    print(f'\n{linkedin_link=}')

                # Scrape the LinkedIn profile with extract_data.py
                profile_data = linkedin_profile(linkedin_link,file_path, v=v)

                dict_scraped[linkedin_link] = profile_data

                # Keep track of profiles with no Experience identified because no Company logo on scraped page
                if profile_data['count_experiences'] == 0:
                    missing_experiences.append(linkedin_link)
                    # os.remove(file_path)
            else:
                os.remove(file_path)

    #### UPDATE GRIST

    ## Recruiters

    lu_recruiters = []

    for linkedin_link, id in grist_recruiters_linkedins.items():
        if linkedin_link in dict_scraped:

            lu_recruiters.append(
                    {   'id': id,
                        'full_name': dict_scraped[linkedin_link]['name'],
                        'connection': dict_scraped[linkedin_link]['connection'],
                        'headline': dict_scraped[linkedin_link]['headline'],
                        'location': dict_scraped[linkedin_link]['location'],
                        'connections': dict_scraped[linkedin_link]['connections'],
                        'status': dict_scraped[linkedin_link]['connection_status'],

                        'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
                        'comp1_name': dict_scraped[linkedin_link]['company_name1'],
                        'comp1_title': dict_scraped[linkedin_link]['title1'],
                        'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
                        'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
                        'comp2_name': dict_scraped[linkedin_link]['company_name2'],
                        'comp2_title': dict_scraped[linkedin_link]['title2'],
                        'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
                        'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
                        'comp3_name': dict_scraped[linkedin_link]['company_name3'],
                        'comp3_title': dict_scraped[linkedin_link]['title3'],
                        'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
                        }
            )

            processed_files.append(dict_scraped[linkedin_link]['file_path'])

    grist_BB.Recruiters.update_records('Linkedins', lu_recruiters)

    print(f"\nUPDATED {len(lu_recruiters)} RECRUITERS LINKEDINS in GRIST.")


    ## VCs

    lu_vcs = []

    for linkedin_link, id in grist_vcs_linkedins.items():
        if linkedin_link in dict_scraped:

            lu_vcs.append(
                    {   'id': id,
                        'full_name': dict_scraped[linkedin_link]['name'],
                        'connection': dict_scraped[linkedin_link]['connection'],
                        'headline': dict_scraped[linkedin_link]['headline'],
                        'location': dict_scraped[linkedin_link]['location'],
                        'connections': dict_scraped[linkedin_link]['connections'],
                        'status': dict_scraped[linkedin_link]['connection_status'],

                        'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
                        'comp1_name': dict_scraped[linkedin_link]['company_name1'],
                        'comp1_title': dict_scraped[linkedin_link]['title1'],
                        'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
                        'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
                        'comp2_name': dict_scraped[linkedin_link]['company_name2'],
                        'comp2_title': dict_scraped[linkedin_link]['title2'],
                        'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
                        'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
                        'comp3_name': dict_scraped[linkedin_link]['company_name3'],
                        'comp3_title': dict_scraped[linkedin_link]['title3'],
                        'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
                        }
            )

            processed_files.append(dict_scraped[linkedin_link]['file_path'])

    grist_BB.VCs.update_records('Linkedins', lu_vcs)

    print(f"\nUPDATED {len(lu_vcs)} VCs LINKEDINS in GRIST.")


    ## Startups

    lu_startups = []

    for linkedin_link, id in grist_startups_linkedins.items():
        if linkedin_link in dict_scraped:

            lu_startups.append(
                    {   'id': id,
                        'full_name': dict_scraped[linkedin_link]['name'],
                        'connection': dict_scraped[linkedin_link]['connection'],
                        'headline': dict_scraped[linkedin_link]['headline'],
                        'location': dict_scraped[linkedin_link]['location'],
                        'connections': dict_scraped[linkedin_link]['connections'],
                        'status': dict_scraped[linkedin_link]['connection_status'],

                        'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
                        'comp1_name': dict_scraped[linkedin_link]['company_name1'],
                        'comp1_title': dict_scraped[linkedin_link]['title1'],
                        'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
                        'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
                        'comp2_name': dict_scraped[linkedin_link]['company_name2'],
                        'comp2_title': dict_scraped[linkedin_link]['title2'],
                        'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
                        'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
                        'comp3_name': dict_scraped[linkedin_link]['company_name3'],
                        'comp3_title': dict_scraped[linkedin_link]['title3'],
                        'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
                        }
            )

            processed_files.append(dict_scraped[linkedin_link]['file_path'])

    grist_BB.Startups.update_records('Linkedins', lu_startups)

    print(f"\nUPDATED {len(lu_startups)} STARTUP LINKEDINS VCs in GRIST.")

    if len(missing_experiences) > 0:
        print()
        print(f"\n\nDELETED following files with Missing Experiences:")
        for m in missing_experiences:
            print(m)

    if len(processed_files) > 0:
        for pf in sorted(set(processed_files)):
            file_name = os.path.basename(pf)
            if os.path.exists(f'/Users/xxx/Downloads/Linkedin/profiles/processed/{file_name}'):
                os.remove(f'/Users/xxx/Downloads/Linkedin/profiles/processed/{file_name}')
            shutil.move(pf, '/Users/xxx/Downloads/Linkedin/profiles/processed')
        print(f"\n\nMOVED {len(processed_files)} scraped files to processed folder.\n")

    # TODO: 
    # - get list of LinkedIn URLs for "People also viewed" section

TODO

Get Also Viewed

15 Jan 2023
Done for already processed
- add logic to be done upon single file processing

Get Company data from Company pages

DONE one-off script for all processed
- add logic to be done upon single file processing
- update records in Grist where needed

Make Grist_Companies a function

  • returns number of employees found.
  • triggered by each company profile visit.

Process each file after visit

  • defines wait time for next visit based on number of employees processed (with function above)

Resources

links

social