15 Jan 2023
Background
As I'm trying to find my next gig, I turn to Linkedin to help me network and find the right opportunity.
I have had a Linkedin Sales Navigator account for a while, each time paid by the company I was working for.
Emerging from my sabbatical, I have subscribed to Linkedin Premium (the Job Seeker version).
In theory this means I can access all the data I need to find the right opportunity.
But in practice, the UX & limitations (300 max profile visits per days? 500?) makes it a challenge to find & keep track of my job search activities.
Ultimately what I need - now or in the future for Sales activities - is a way for me to process the data on my own terms (ie in my own systems) vs. being imposed the cumbersome UI/UX of Linkedin.
I have played with Linkedin automation tools in the past, but got flagged by Linkedin and had to stop.
There are many tools these days, but they all have the same issue: they violate Linkedin's terms of service, and can be expensive.
The reason Linkedin's terms prevent from automating the data extraction is because the data is their asset (though it should not be - it's the users' data) and they need Linkedin pages to be visited (ie displayed on screen) so as to show ads, which they make money on.
I believe that should not be the case considering I'm paying for the service, but that's another story.
API
Linkedin's API is gated (requires a developer account subject to their review) and the data is very limited.
So I decided to build my own solution.
Goal
My solution is based on ensuring pages are visited "as normal" (on my screen, using a non-automation controlled browser) so that they get their ads revenues.
Noting also that I usually visit Linkedin manually with the Brave browser Brave , which blocks ads.
Here I will automate the visit of the Linkedin pages as if it was me, using Chrome (ie with ads display), and automate the data extraction.
Ultimately what I want is the ability to have a function that takes a company website URL and outputs a list of employees (or defined key people, eg CEO) to my data repository where I can manipulate the data at ease, eg:
- filter by any criteria
- keep track of my progress (eg job search or sales prospecting)
At the same time visiting those people's profiles as a "soft touch" (ie they see my profile in their "who viewed your profile" list).
Linkedin get their ads revenues + my subscription fees. And I use the data only for my own needs, so in line with their terms.
Risks
Risk is to get flagged by Linkedin, with account restrictions or even account suspension.
However, considering the following, I hope the risks are limited:
- I'm not automating the visit of pages by clicking on links
- I will stay under the daily visits limit (see below)
- I'm paying for a subscription + let their ads display (double revenue for Linkedin) which I see.
- not using the data for spam, only targeted outreach (that's the whole point of the automation, otherwise tedious manual work to target precisely)
- Connects & InMails are sent manually (manual/high-touch time investment here, where it matters, instead of at the data stage)
This project requires some tech-savviness & Python knowledge, it is not for everyone.
Limits
Visits
Rule of thumb: don't view more pages that a professional (or VA) could do manually in a day.
- 300 max profile visits per day? 500?
Connects
no more than 3–5% of your total LinkedIn connections (https://support.dux-soup.com/article/45-how-many-connection-requests-to-send).
em
Structure
- Chrome extension to download pages as HTML files upon visit (ie page load)
- Python script to visit a list of pages (ie profiles), both Company and People
- Python script to extract data from the HTML files
My data repository is Grist Grist | The Evolution of Spreadsheets.
Chrome extension (download pages as HTML files)
See Building a custom Google Chrome extension
Chrome extensions require simply a folder containing at least 2 files:
- manifest.json
- background.js
background.js
for Linkedin Company profiles
chrome.runtime.onMessage.addListener(function(request, sender, sendResponse) {
if(request.message=="download"){
html = request.html;
var dataUrl = 'data:text/html,' + encodeURIComponent(html);
filename = new URL(request.urlo)
filename = filename.href.replace(filename.origin,"").replace(/[^a-zA-Z0-9]/g, '_');
chrome.downloads.download({
'url': dataUrl,
'filename': "Linkedin/companies/"+filename+".html",
'saveAs': false
});
}
})
chrome.tabs.onUpdated.addListener(function(tabId, changeInfo, tab) {
if (tab.status == 'complete' && tab.url.search("chrome://")<0) {
urlo = tab.url
if(urlo.search("linkedin.com/company/")>=0){
chrome.scripting.executeScript({ target: { tabId }, func:runner,args:[urlo]});
}
}
function runner(urlo){
if(typeof(exist)=="undefined"){
setTimeout(function(){
chrome.runtime.sendMessage({"message":"download","html":document.body.innerHTML,urlo:urlo})
},4000)
exist = true
}
}
});
for Linkedin People profiles
I use a second Chrome extension for People profiles, with the same background.js
as above, just changing these lines:
Different save folder in my Downloads folder: 'filename': "Linkedin/people/"+filename+".html",
Save HTML based on page URL: if(urlo.search("linkedin.com/in/")>=0){
manifest.json
for Linkedin Company profiles
{
"name": "Save Linkedin Companies",
"version": "3.0",
"manifest_version": 3,
"description": "saves automatically Linkedin Company profiles visited to given folder as HTML file",
"icons": {
"16": "icon.png",
"32": "icon.png",
"48": "icon.png",
"128": "icon.png"
},
"background": {
"service_worker": "background.js"
},
"host_permissions": ["<all_urls>"],
"permissions":[
"downloads",
"scripting",
"tabs"
]
}
for Linkedin People profiles
Same as above, just changing the name. The difference will be only in the background.js file triggered by the extension.
TODO - figure out how to have them both in a single extension (not critical at this stage).
Python scripts (visit profiles & extract data)
Visitee
####################
# Visit Linkedin profiles
test = False
v = False # verbose mode
### Script-specific imports
import webbrowser
import random
# from random import randint, uniform
import visited
from my_utils import linkedin_handle_from_url, linkedin_url_from_handle
import add_to_Grist_People
import add_to_Grist_Companies
### Global Variables
DB = 'RECRUITERS' # 'STARTUPS' or 'VCs' or 'RECRUITERS'
VISIT = 'PERSON' # 'COMPANY' or 'PERSON'
count_to_visit = random.randint(69, 93)
if test:
count_to_visit = 2
print(f"\ncount_to_visit: {count_to_visit}")
wait_from = 13.5
wait_to = 27.8
count_visited = 0
count_visited_today = visited.people_today()
### Functions
# def linkedin_handle_from_url(url):
# # People LinkedIn URLs
# if 'linkedin.com/in/' in url:
# handle = url.split('linkedin.com/in/')[1]
# # Company LinkedIn URLs
# elif 'linkedin.com/company/' in url:
# handle = url.split('linkedin.com/company/')[1]
# elif 'linkedin.com/showcase/' in url:
# handle = url.split('linkedin.com/showcase/')[1]
# # Process handle
# if '?' in handle:
# handle = handle.split('?')[0]
# if '/' in handle:
# handle = handle.replace('/', '')
# return handle
# def linkedin_url_from_handle(linkedin_handle):
# return f'https://www.linkedin.com/in/{linkedin_handle}'
### Main
print(f"\nLoading data for {DB}...")
if DB == 'STARTUPS':
data_companies = grist_BB.Startups.fetch_table('Master')
data_people = grist_BB.Startups.fetch_table('Linkedins')
elif DB == 'VCs':
data_companies = grist_BB.VCs.fetch_table('Master')
data_people = grist_BB.VCs.fetch_table('Linkedins')
elif DB == 'RECRUITERS':
data_companies = grist_BB.Recruiters.fetch_table('Master')
data_people = grist_BB.Recruiters.fetch_table('Linkedins')
print(f"{len(data_companies)} companies")
print(f"{len(data_people)} people")
print(f"\nStarting Visitee {VISIT} {DB}...")
if VISIT == 'PERSON':
data_companies_to_visit = [
x.id for x in data_companies
# if x.yes == True
# if 'VE_Providers' in x.notes
# if 'US' in x.gristHelper_Display2
if not x.discard
]
data_linkedins_to_visit = {
x.linkedin:x.id for x in data_people
if x.domain in data_companies_to_visit
and x.linkedin not in [None, '']
and not x.downloaded
and not x.discard
# and x.PI == True
}
to_do = len(data_linkedins_to_visit)
if to_do < count_to_visit:
count_to_visit = to_do
print(f"\n\n{to_do} PERSON Linkedin profiles to visit from {DB}.\n\n")
### Functions
### Main
if to_do > 0:
for linkedin,id in data_linkedins_to_visit.items():
if count_visited < count_to_visit:
count_visited += 1
count_visited_today += 1
to_do = to_do - 1
print(f"\n{count_visited}/{count_to_visit} ({to_do} total left / {count_visited_today} visited today) {linkedin}")
update_data = [{
'id': id,
'visited': ts_db,
'downloaded': True,
}]
if not test:
webbrowser.get('chrome').open_new_tab(linkedin)
if DB == 'STARTUPS':
grist_BB.Startups.update_records('Linkedins', update_data)
elif DB == 'VCs':
grist_BB.VCs.update_records('Linkedins', update_data)
elif DB == 'RECRUITERS':
grist_BB.Recruiters.update_records('Linkedins', update_data)
wait = random.uniform(wait_from, wait_to)
print(f"waiting {round(wait, 1)} seconds")
for w in range(1,int(wait) + 1):
print("\r" + str(w), end='')
time.sleep(1)
print()
print(f"\nLEFT TO DO: {to_do}")
### ADD TO GRIST
add_to_Grist_People.process()
### CHECK VISITED
if not test:
visited.people_today()
if VISIT == 'COMPANY':
data_companies_to_visit = {
x.linkedin:x.id for x in data_companies
if x.linkedin not in [None, '']
and not '/showcase/' in x.linkedin
# and x.yes == True
# and not x.ni
and not x.downloaded
# and 'VE_Providers' in x.notes
}
to_do = len(data_companies_to_visit)
if to_do < count_to_visit:
count_to_visit = to_do
print(f"\n\n{to_do} COMPANY Linkedin profiles to visit from {DB}.\n\n")
if to_do > 0:
### Main
for linkedin_url,id in data_companies_to_visit.items():
if v:
print(f"\n{linkedin_url=}")
if count_visited < count_to_visit:
count_visited += 1
count_visited_today += 1
to_do = to_do - 1
# print(f"\n{count_visited}/{count_to_visit} ({to_do} total left / {count_visited_today} visited today) {linkedin_url}")
linkedin_handle = linkedin_handle_from_url(linkedin_url)
if v:
print(f"{linkedin_handle=}")
if DB in ['STARTUPS']:
linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/?keywords=ceo"
if DB in ['RECRUITERS']:
linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/"
elif DB in ['VCs']:
linkedin = f"https://www.linkedin.com/company/{linkedin_handle}/people/?keywords=partner"
print(f"\n{count_visited}/{count_to_visit} ({to_do} total left) {linkedin}")
update_data = [{
'id': id,
'downloaded': True,
}]
if not test:
webbrowser.get('chrome').open_new_tab(linkedin)
if DB == 'STARTUPS':
grist_BB.Startups.update_records('Master', update_data)
elif DB == 'VCs':
grist_BB.VCs.update_records('Master', update_data)
elif DB == 'RECRUITERS':
grist_BB.Recruiters.update_records('Master', update_data)
wait = random.uniform(wait_from, wait_to)
print(f"waiting {round(wait, 1)} seconds")
for w in range(1,int(wait) + 1):
print("\r" + str(w), end='')
time.sleep(1)
print()
### ADD TO GRIST
# First Companies
add_to_Grist_Companies.process()
# Then People extracted from those companies
add_to_Grist_People.process()
### CHECK VISITED
if not test:
visited.people_today()
Extractee
Companies
####################
# Add scraped data from Linkedin COMPANY pages to Grist
### Script-specific imports
from extract_data import company_profile
# import requests
from tqdm import tqdm
import shutil
import webbrowser
from random import randint
### Functions
def fetch_file_paths(folder_path):
global test
file_paths = []
for root, dirs, files in os.walk(folder_path):
if root == folder_path: # remove to also crawl subfolders
for file in files:
file_path = os.path.join(root, file)
file_paths.append(file_path)
if test:
file_paths = file_paths[:10]
return file_paths
def linkedin_handle_from_url(url, v=False):
# People LinkedIn URLs
if 'linkedin.com/in/' in url:
handle = url.split('linkedin.com/in/')[1]
# Company LinkedIn URLs
elif 'linkedin.com/company/' in url:
handle = url.split('linkedin.com/company/')[1]
elif 'linkedin.com/showcase/' in url:
handle = url.split('linkedin.com/showcase/')[1]
# Process handle
if '?' in handle:
handle = handle.split('?')[0]
if '/' in handle:
handle = handle.replace('/', '')
return handle
def linkedin_link_from_file_path(file_path, v=False):
if v:
print(f"{file_path=}")
# Recreate the LinkedIn URL based on file name
linkedin_link = file_path.replace('/Users/xxx/Downloads/Linkedin/companies/_company_', 'https://www.linkedin.com/company/')
if '_people__keywords_ceo.html' in linkedin_link:
linkedin_link = linkedin_link.replace('_people__keywords_ceo.html', '')
if '_people__keywords_partner.html' in linkedin_link:
linkedin_link = linkedin_link.replace('_people__keywords_partner.html', '')
if '_people__keywords_saas.html' in linkedin_link:
linkedin_link = linkedin_link.replace('_people__keywords_saas.html', '')
if '_people_.html' in linkedin_link:
linkedin_link = linkedin_link.replace('_people_.html', '')
# TODO make it more generic to capture any keyword
linkedin_link = linkedin_link.replace('.html', '')
if linkedin_link.endswith('--'):
linkedin_link = linkedin_link[ : -2 ]
if linkedin_link.endswith('-'):
linkedin_link = linkedin_link[ : -1 ]
# if v:
# print(f'\n{loc} func(linkedin_link_from_file_path) #{ln()}: {linkedin_link=}')
return linkedin_link
### Main
#### COMPANIES
def process():
count = 0
processed_files = []
error_files = []
error_handles = []
count_total = 0
count_matches = 0
company_categories_found = set()
lc_vcs_linkedins = []
lc_recruiters_linkedins = []
lc_startups_linkedins = []
print(f"\n============ STARTING add_to_Grist_Companies\n")
companies_folder_path = '/Users/xxx/Downloads/Linkedin/companies'
#### Dicts of Grist COMPANY Linkedin handles (per Grist doc), as UID
grist_recruiters_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.Recruiters.fetch_table('Master') if x.linkedin not in [None, '']}
print(f"Fetched {len(grist_recruiters_companies)} Recruiters Company profiles from Grist.")
grist_vcs_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.VCs.fetch_table('Master') if x.linkedin not in [None, '']}
print(f"Fetched {len(grist_vcs_companies)} VCs Company profiles from Grist.")
grist_startups_companies = {linkedin_handle_from_url(x.linkedin):int(x.id) for x in grist_BB.Startups.fetch_table('Master') if x.linkedin not in [None, '']}
print(f"Fetched {len(grist_startups_companies)} Startups Company profiles from Grist.\n")
#### Dicts of Grist PERSON Profiles (per Grist doc), as UID
grist_recruiters_linkedins = {
linkedin_handle_from_url(x.linkedin):int(x.id)
for x in grist_BB.Recruiters.fetch_table('Linkedins')
if x.linkedin not in [None, '']
}
print(f"Fetched {len(grist_recruiters_linkedins)} Recruiters Linkedin profiles from Grist.")
grist_vcs_linkedins = {
linkedin_handle_from_url(x.linkedin):int(x.id)
for x in grist_BB.VCs.fetch_table('Linkedins')
if x.linkedin not in [None, '']
}
print(f"Fetched {len(grist_vcs_linkedins)} VCs Linkedin profiles from Grist.")
grist_startups_linkedins = {
linkedin_handle_from_url(x.linkedin):int(x.id)
for x in grist_BB.Startups.fetch_table('Linkedins')
if x.linkedin not in [None, '']
}
print(f"Fetched {len(grist_startups_linkedins)} Startups Linkedin profiles from Grist.\n")
#### Dict of Scraped profiles
# def process_companies_files(): # TODO make function so as to be called again later with errors
file_paths = fetch_file_paths(companies_folder_path)
dict_scraped_vcs = {}
dict_scraped_recruiters = {}
dict_scraped_startups = {}
for file_path in tqdm(file_paths):
count_total += 1
# print(f"{file_path=}")
if file_path.endswith((".html")):
if '(' not in file_path:
if '_people_' in file_path:
count_matches += 1
if v:
print(f"\n\n#######################\n{count_matches}\n{file_path=}")
linkedin_link = linkedin_link_from_file_path(file_path, v=v)
linkedin_handle = linkedin_handle_from_url(linkedin_link, v=v)
try:
# Scrape the LinkedIn profile with extract_data.py
company_profile_data = company_profile(file_path, v=v)
# Categorisation
category = company_profile_data['category']
company_categories_found.add(category) # for overview of all categories found
if 'Recruiting' in category:
classify_as = 'Recruiter'
elif 'Venture Capital' in category:
classify_as = 'VC'
else:
classify_as = 'Startup'
# Add to dicts per Grist Doc with Linkedin handle as key
if classify_as == 'Recruiter':
dict_scraped_recruiters[linkedin_handle] = company_profile_data
elif classify_as == 'VC':
dict_scraped_vcs[linkedin_handle] = company_profile_data
elif classify_as == 'Startup':
dict_scraped_startups[linkedin_handle] = company_profile_data
processed_files.append(file_path)
except:
print(f"Error with handle {linkedin_handle}: removing file {file_path}")
error_handles.append(linkedin_handle)
os.remove(file_path)
else:
os.remove(file_path)
else:
os.remove(file_path)
#### Process to Grist
print(f"\n\nProcessing scraped data:\n{len(dict_scraped_recruiters)} Recruiters\n{len(dict_scraped_vcs)} VCs\n{len(dict_scraped_startups)} Startups")
# processed_startups = []
# print()
# pp.pprint(dict_scraped_startups)
# print()
### STARTUPS
for linkedin_company_handle, data in dict_scraped_startups.items():
if v:
print(f"\n{linkedin_company_handle=}")
print(f"\ndata={pp.pprint(data)}")
file_path = data['file_path']
# PEOPLE DATA
people = data['people']
for l,p in people.items():
if v:
print(f"\n{l=}")
print(f"\n{p=}")
linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
if '%' not in linkedin_handle:
connection = p['connection']
new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
name = p['name']
title = p['title']
person_data = {
'connection': connection,
'linkedin': new_linkedin_link,
'full_name': name,
'headline': title,
'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
'created': ts_db,
}
## PREPARE TO ADD TO GRIST
if v:
print(f"\n{linkedin_handle=}\n")
### Startups
if linkedin_company_handle in grist_startups_companies:
if linkedin_handle not in grist_startups_linkedins:
# Get the Foreign Key for the Master record to associate People with
fk = grist_startups_companies[linkedin_company_handle] # int x.id
# Update person_data object with Foreign Key
person_data['domain'] = fk
# Append to Startups list to create in Grist
lc_startups_linkedins.append(person_data)
print(f"Added {name} / {new_linkedin_link} to Startups")
# COMPANY DATA
# TODO goal: update Master record with data from Company Linkedin profile
processed_files.append(file_path)
### RECRUITERS
for linkedin_company_handle, data in dict_scraped_recruiters.items():
if v:
print(f"\n{linkedin_company_handle=}")
print(f"\ndata={pp.pprint(data)}")
file_path = data['file_path']
# PEOPLE DATA
people = data['people']
for l,p in people.items():
if v:
print(f"\n{l=}")
print(f"\n{p=}")
linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
if '%' not in linkedin_handle:
connection = p['connection']
new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
name = p['name']
title = p['title']
person_data = {
'connection': connection,
'linkedin': new_linkedin_link,
'full_name': name,
'headline': title,
'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
'created': ts_db,
}
## PREPARE TO ADD TO GRIST
if v:
print(f"\n{linkedin_handle=}\n")
if linkedin_company_handle in grist_recruiters_companies:
if linkedin_handle not in grist_recruiters_linkedins:
fk = grist_recruiters_companies[linkedin_company_handle] # int x.id
person_data['domain'] = fk
lc_recruiters_linkedins.append(person_data)
print(f"Added {name} / {new_linkedin_link} to recruiters")
# COMPANY DATA
# TODO goal: update Master record with data from Company Linkedin profile
processed_files.append(file_path)
### VCs
for linkedin_company_handle, data in dict_scraped_vcs.items():
if v:
print(f"\n{linkedin_company_handle=}")
print(f"\ndata={pp.pprint(data)}")
file_path = data['file_path']
# PEOPLE DATA
people = data['people']
for l,p in people.items():
if v:
print(f"\n{l=}")
print(f"\n{p=}")
linkedin_handle = linkedin_handle_from_url(p['linkedin_link'])
if '%' not in linkedin_handle:
connection = p['connection']
new_linkedin_link = f"https://www.linkedin.com/in/{linkedin_handle}"
name = p['name']
title = p['title']
person_data = {
'connection': connection,
'linkedin': new_linkedin_link,
'full_name': name,
'headline': title,
'src': f"https://www.linkedin.com/company/{linkedin_company_handle}",
'created': ts_db,
}
## PREPARE TO ADD TO GRIST
if v:
print(f"\n{linkedin_handle=}\n")
if linkedin_company_handle in grist_vcs_companies:
if linkedin_handle not in grist_vcs_linkedins:
fk = grist_vcs_companies[linkedin_company_handle] # int x.id
person_data['domain'] = fk
lc_vcs_linkedins.append(person_data)
print(f"Added {name} / {new_linkedin_link} to vcs")
# COMPANY DATA
# TODO goal: update Master record with data from Company Linkedin profile
processed_files.append(file_path)
# ADD TO GRIST
if test:
print(f"\n\nTEST:")
print(f"{len(lc_startups_linkedins)} Startups Linkedins to Grist:")
print(f"{len(lc_vcs_linkedins)} VCs Linkedins to Grist:")
print(f"{len(lc_recruiters_linkedins)} Recruiters Linkedins to Grist:")
else:
if len(lc_startups_linkedins) > 0:
grist_BB.Startups.add_records('Linkedins', lc_startups_linkedins)
print(f"\n\nADDED {len(lc_startups_linkedins)} Startups Linkedins profiles to Grist")
if len(lc_vcs_linkedins) > 0:
grist_BB.VCs.add_records('Linkedins', lc_vcs_linkedins)
print(f"ADDED {len(lc_vcs_linkedins)} VCs Linkedins profiles to Grist")
if len(lc_recruiters_linkedins) > 0:
grist_BB.Recruiters.add_records('Linkedins', lc_recruiters_linkedins)
print(f"ADDED {len(lc_recruiters_linkedins)} Recruiters Linkedins profiles to Grist")
# MOVE PROCESSED FILES
print()
if not test:
for pf in sorted(set(processed_files)):
file_name = os.path.basename(pf)
# print(f"{pf=}")
# print(f"{file_name=}")
if os.path.exists(f'/Users/xxx/Downloads/Linkedin/companies/processed/{file_name}'):
os.remove(f'/Users/xxx/Downloads/Linkedin/companies/processed/{file_name}')
shutil.move(pf, '/Users/xxx/Downloads/Linkedin/companies/processed')
print(f"\n\n============ Moved {len(processed_files)} scraped files to processed folder.\n")
# CATEGORIES OF COMPANIES PROCESSED (for classification tweaking/troubleshooting)
print(f"\nFYI - COMPANY CATEGORIES FOUND:")
pp.pprint(company_categories_found)
print()
if len(error_handles) > 0:
# print(f"\n{len(error_handles)} ERROR WITH FILES - opening them now on People section...")
print(f"\n{len(error_handles)} ERROR WITH FILES:")
for handle in error_handles:
link_to_open = f"https://www.linkedin.com/company/{handle}/people/"
print(f"{link_to_open}")
# # 230113 Stopped visiting automatically as most are dead links
# if handle.strip() not in ['unavailable','404']:
# print(f"Opening {link_to_open}")
# webbrowser.get('chrome').open_new_tab(link_to_open)
# wait = randint(12, 29)
# print(f"waiting {wait} seconds")
# for w in range(1,wait + 1):
# # print(w, end=" ", flush=True)
# print("\r" + str(w), end='')
# time.sleep(1)
# print()
print(f"\n\n{count_total=}")
print(f"{count_matches=}")
# print(f"{count=}")
# TODO manage errors by visiting standard Company link & doing a second pass (make function first)
People
### Script-specific imports
from extract_data import linkedin_profile, company_profile
import requests
from tqdm import tqdm
import shutil
### Global Variables
# RUN = 'PEOPLE' # 'PEOPLE' or 'COMPANIES' or 'ALL'
folder_path = '/Users/xxx/Downloads/Linkedin/profiles'
missing_experiences = []
processed_files = []
### Functions
def fetch_file_paths(folder_path):
global test
file_paths = []
for root, dirs, files in os.walk(folder_path):
if root == folder_path: # remove to also crawl subfolders
for file in files:
file_path = os.path.join(root, file)
file_paths.append(file_path)
if test:
file_paths = file_paths[:10]
return file_paths
def linkedin_handle_from_url(url):
# People LinkedIn URLs
if 'linkedin.com/in/' in url:
handle = url.split('linkedin.com/in/')[1]
# Company LinkedIn URLs
elif 'linkedin.com/company/' in url:
handle = url.split('linkedin.com/company/')[1]
elif 'linkedin.com/showcase/' in url:
handle = url.split('linkedin.com/showcase/')[1]
# Process handle
if '?' in handle:
handle = handle.split('?')[0]
if '/' in handle:
handle = handle.replace('/', '')
return handle
### Main
#### PEOPLE
def process():
global count
print(f"\n============ STARTING add_to_Grist_People\n")
# Dicts of Grist profiles (per Grist doc)
grist_recruiters_linkedins = {}
for x in grist_BB.Recruiters.fetch_table('Linkedins'):
if x.linkedin.endswith('/'):
linkedin = x.linkedin[:-1]
else:
linkedin = x.linkedin
grist_recruiters_linkedins[linkedin] = int(x.id)
print(f"Fetched {len(grist_recruiters_linkedins)} Recruiters Linkedin profiles from Grist.")
grist_vcs_linkedins = {}
for x in grist_BB.VCs.fetch_table('Linkedins'):
if x.linkedin.endswith('/'):
linkedin = x.linkedin[:-1]
else:
linkedin = x.linkedin
grist_vcs_linkedins[linkedin] = int(x.id)
print(f"Fetched {len(grist_vcs_linkedins)} VCs Linkedin profiles from Grist.")
grist_startups_linkedins = {}
for x in grist_BB.Startups.fetch_table('Linkedins'):
if x.linkedin.endswith('/'):
linkedin = x.linkedin[:-1]
else:
linkedin = x.linkedin
grist_startups_linkedins[linkedin] = int(x.id)
print(f"Fetched {len(grist_startups_linkedins)} Startups Linkedin profiles from Grist.\n")
# Dict of Scraped profiles
file_paths = fetch_file_paths(folder_path)
dict_scraped = {}
for file_path in tqdm(file_paths):
if file_path.endswith((".html")):
if '(' not in file_path:
count += 1
if v:
print(f"\n\n#######################\n{count}\n{file_path=}")
# Recreate the LinkedIn URL based on file name
linkedin_link = file_path.replace('/Users/xxx/Downloads/Linkedin/profiles/_in_', 'https://www.linkedin.com/in/')
if 'originalSubdomain' in linkedin_link:
linkedin_link = linkedin_link.split('originalSubdomain')[0]
if '_' in linkedin_link:
linkedin_link = linkedin_link.replace('_', '-')
linkedin_link = linkedin_link.replace('-.html', '')
if linkedin_link.endswith('--'):
linkedin_link = linkedin_link[ : -2 ]
if linkedin_link.endswith('-'):
linkedin_link = linkedin_link[ : -1 ]
if v:
print(f'\n{linkedin_link=}')
# Scrape the LinkedIn profile with extract_data.py
profile_data = linkedin_profile(linkedin_link,file_path, v=v)
dict_scraped[linkedin_link] = profile_data
# Keep track of profiles with no Experience identified because no Company logo on scraped page
if profile_data['count_experiences'] == 0:
missing_experiences.append(linkedin_link)
# os.remove(file_path)
else:
os.remove(file_path)
#### UPDATE GRIST
## Recruiters
lu_recruiters = []
for linkedin_link, id in grist_recruiters_linkedins.items():
if linkedin_link in dict_scraped:
lu_recruiters.append(
{ 'id': id,
'full_name': dict_scraped[linkedin_link]['name'],
'connection': dict_scraped[linkedin_link]['connection'],
'headline': dict_scraped[linkedin_link]['headline'],
'location': dict_scraped[linkedin_link]['location'],
'connections': dict_scraped[linkedin_link]['connections'],
'status': dict_scraped[linkedin_link]['connection_status'],
'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
'comp1_name': dict_scraped[linkedin_link]['company_name1'],
'comp1_title': dict_scraped[linkedin_link]['title1'],
'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
'comp2_name': dict_scraped[linkedin_link]['company_name2'],
'comp2_title': dict_scraped[linkedin_link]['title2'],
'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
'comp3_name': dict_scraped[linkedin_link]['company_name3'],
'comp3_title': dict_scraped[linkedin_link]['title3'],
'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
}
)
processed_files.append(dict_scraped[linkedin_link]['file_path'])
grist_BB.Recruiters.update_records('Linkedins', lu_recruiters)
print(f"\nUPDATED {len(lu_recruiters)} RECRUITERS LINKEDINS in GRIST.")
## VCs
lu_vcs = []
for linkedin_link, id in grist_vcs_linkedins.items():
if linkedin_link in dict_scraped:
lu_vcs.append(
{ 'id': id,
'full_name': dict_scraped[linkedin_link]['name'],
'connection': dict_scraped[linkedin_link]['connection'],
'headline': dict_scraped[linkedin_link]['headline'],
'location': dict_scraped[linkedin_link]['location'],
'connections': dict_scraped[linkedin_link]['connections'],
'status': dict_scraped[linkedin_link]['connection_status'],
'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
'comp1_name': dict_scraped[linkedin_link]['company_name1'],
'comp1_title': dict_scraped[linkedin_link]['title1'],
'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
'comp2_name': dict_scraped[linkedin_link]['company_name2'],
'comp2_title': dict_scraped[linkedin_link]['title2'],
'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
'comp3_name': dict_scraped[linkedin_link]['company_name3'],
'comp3_title': dict_scraped[linkedin_link]['title3'],
'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
}
)
processed_files.append(dict_scraped[linkedin_link]['file_path'])
grist_BB.VCs.update_records('Linkedins', lu_vcs)
print(f"\nUPDATED {len(lu_vcs)} VCs LINKEDINS in GRIST.")
## Startups
lu_startups = []
for linkedin_link, id in grist_startups_linkedins.items():
if linkedin_link in dict_scraped:
lu_startups.append(
{ 'id': id,
'full_name': dict_scraped[linkedin_link]['name'],
'connection': dict_scraped[linkedin_link]['connection'],
'headline': dict_scraped[linkedin_link]['headline'],
'location': dict_scraped[linkedin_link]['location'],
'connections': dict_scraped[linkedin_link]['connections'],
'status': dict_scraped[linkedin_link]['connection_status'],
'comp1_linkedin': dict_scraped[linkedin_link]['company_linkedin1'],
'comp1_name': dict_scraped[linkedin_link]['company_name1'],
'comp1_title': dict_scraped[linkedin_link]['title1'],
'comp1_dates': dict_scraped[linkedin_link]['company_dates1'],
'comp2_linkedin': dict_scraped[linkedin_link]['company_linkedin2'],
'comp2_name': dict_scraped[linkedin_link]['company_name2'],
'comp2_title': dict_scraped[linkedin_link]['title2'],
'comp2_dates': dict_scraped[linkedin_link]['company_dates2'],
'comp3_linkedin': dict_scraped[linkedin_link]['company_linkedin3'],
'comp3_name': dict_scraped[linkedin_link]['company_name3'],
'comp3_title': dict_scraped[linkedin_link]['title3'],
'comp3_dates': dict_scraped[linkedin_link]['company_dates3'],
}
)
processed_files.append(dict_scraped[linkedin_link]['file_path'])
grist_BB.Startups.update_records('Linkedins', lu_startups)
print(f"\nUPDATED {len(lu_startups)} STARTUP LINKEDINS VCs in GRIST.")
if len(missing_experiences) > 0:
print()
print(f"\n\nDELETED following files with Missing Experiences:")
for m in missing_experiences:
print(m)
if len(processed_files) > 0:
for pf in sorted(set(processed_files)):
file_name = os.path.basename(pf)
if os.path.exists(f'/Users/xxx/Downloads/Linkedin/profiles/processed/{file_name}'):
os.remove(f'/Users/xxx/Downloads/Linkedin/profiles/processed/{file_name}')
shutil.move(pf, '/Users/xxx/Downloads/Linkedin/profiles/processed')
print(f"\n\nMOVED {len(processed_files)} scraped files to processed folder.\n")
# TODO:
# - get list of LinkedIn URLs for "People also viewed" section
Keyboard shortcuts:
16 Mar 2025
Option + C: trigger Chrome extension for automated Linkedin connect.
Cmd + Space (Alfred) > "dl": Discard Linkedin
TODO
Get Also Viewed
15 Jan 2023
Done for already processed
- add logic to be done upon single file processing
Get Company data from Company pages
DONE one-off script for all processed
- add logic to be done upon single file processing
- update records in Grist where needed
Make Grist_Companies a function
- returns number of employees found.
- triggered by each company profile visit.
Process each file after visit
- defines wait time for next visit based on number of employees processed (with function above)
Resources
