Adding a Custom Job Board Scraper¶
Peregrine supports pluggable custom job board scrapers. Standard boards use the JobSpy library. Custom scrapers handle boards with non-standard APIs, paywalls, or SSR-rendered pages.
This guide walks through adding a new scraper from scratch.
Step 1 — Create the scraper module¶
Create scripts/custom_boards/myboard.py. Every custom scraper must implement one function:
# scripts/custom_boards/myboard.py
def scrape(profile: dict, db_path: str) -> list[dict]:
"""
Scrape job listings from MyBoard for the given search profile.
Args:
profile: The active search profile dict from search_profiles.yaml.
Keys include: titles (list), locations (list),
hours_old (int), results_per_board (int).
db_path: Absolute path to staging.db. Use this if you need to
check for existing URLs before returning.
Returns:
List of job dicts. Each dict must contain at minimum:
title (str) — job title
company (str) — company name
url (str) — canonical job URL (used as unique key)
source (str) — board identifier, e.g. "myboard"
location (str) — "Remote" or "City, State"
is_remote (bool) — True if remote
salary (str) — salary string or "" if unknown
description (str) — full job description text or "" if unavailable
date_found (str) — ISO 8601 datetime string, e.g. "2026-02-25T12:00:00"
"""
jobs = []
for title in profile.get("titles", []):
for location in profile.get("locations", []):
results = _fetch_from_myboard(title, location, profile)
jobs.extend(results)
return jobs
def _fetch_from_myboard(title: str, location: str, profile: dict) -> list[dict]:
"""Internal helper — call the board's API and transform results."""
import requests
from datetime import datetime
params = {
"q": title,
"l": location,
"limit": profile.get("results_per_board", 50),
}
try:
resp = requests.get(
"https://api.myboard.com/jobs",
params=params,
timeout=15,
)
resp.raise_for_status()
data = resp.json()
except Exception as e:
print(f"[myboard] fetch error: {e}")
return []
jobs = []
for item in data.get("results", []):
jobs.append({
"title": item.get("title", ""),
"company": item.get("company", ""),
"url": item.get("url", ""),
"source": "myboard",
"location": item.get("location", ""),
"is_remote": "remote" in item.get("location", "").lower(),
"salary": item.get("salary", ""),
"description": item.get("description", ""),
"date_found": datetime.utcnow().isoformat(),
})
return jobs
Required fields¶
| Field | Type | Notes |
|---|---|---|
title |
str | Job title |
company |
str | Company name |
url |
str | Unique key — must be stable and canonical |
source |
str | Short board identifier, e.g. "myboard" |
location |
str | "Remote" or "City, ST" |
is_remote |
bool | True if remote |
salary |
str | Salary string or "" |
description |
str | Full description text or "" |
date_found |
str | ISO 8601 UTC datetime |
Deduplication¶
discover.py deduplicates by url before inserting into the database. If a job with the same URL already exists, it is silently skipped. You do not need to handle deduplication inside your scraper.
Rate limiting¶
Be a good citizen:
- Add a time.sleep(0.5) between paginated requests
- Respect Retry-After headers
- Do not scrape faster than a human browsing the site
- If the site provides an official API, prefer that over scraping HTML
Credentials¶
If your scraper requires API keys or credentials:
- Create config/myboard.yaml.example as a template
- Create config/myboard.yaml (gitignored) for live credentials
- Read it in your scraper with yaml.safe_load(open("config/myboard.yaml"))
- Document the credential setup in comments at the top of your module
Step 2 — Register the scraper¶
Open scripts/discover.py and add your scraper to the CUSTOM_SCRAPERS dict:
from scripts.custom_boards import adzuna, theladders, craigslist, myboard
CUSTOM_SCRAPERS = {
"adzuna": adzuna.scrape,
"theladders": theladders.scrape,
"craigslist": craigslist.scrape,
"myboard": myboard.scrape, # add this line
}
Step 3 — Activate in a search profile¶
Open config/search_profiles.yaml and add myboard to custom_boards in any profile:
profiles:
- name: cs_leadership
boards:
- linkedin
- indeed
custom_boards:
- adzuna
- myboard # add this line
titles:
- Customer Success Manager
locations:
- Remote
Step 4 — Write a test¶
Create tests/test_myboard.py. Mock the HTTP call to avoid hitting the live API during tests:
# tests/test_myboard.py
from unittest.mock import patch
from scripts.custom_boards.myboard import scrape
MOCK_RESPONSE = {
"results": [
{
"title": "Customer Success Manager",
"company": "Acme Corp",
"url": "https://myboard.com/jobs/12345",
"location": "Remote",
"salary": "$80,000 - $100,000",
"description": "We are looking for a CSM...",
}
]
}
def test_scrape_returns_correct_shape():
profile = {
"titles": ["Customer Success Manager"],
"locations": ["Remote"],
"results_per_board": 10,
"hours_old": 240,
}
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
mock_get.return_value.ok = True
mock_get.return_value.raise_for_status = lambda: None
mock_get.return_value.json.return_value = MOCK_RESPONSE
jobs = scrape(profile, db_path="nonexistent.db")
assert len(jobs) == 1
job = jobs[0]
# Required fields
assert "title" in job
assert "company" in job
assert "url" in job
assert "source" in job
assert "location" in job
assert "is_remote" in job
assert "salary" in job
assert "description" in job
assert "date_found" in job
assert job["source"] == "myboard"
assert job["title"] == "Customer Success Manager"
assert job["url"] == "https://myboard.com/jobs/12345"
def test_scrape_handles_http_error_gracefully():
profile = {
"titles": ["Customer Success Manager"],
"locations": ["Remote"],
"results_per_board": 10,
"hours_old": 240,
}
with patch("scripts.custom_boards.myboard.requests.get") as mock_get:
mock_get.side_effect = Exception("Connection refused")
jobs = scrape(profile, db_path="nonexistent.db")
assert jobs == []
Existing Scrapers as Reference¶
| Scraper | Notes |
|---|---|
scripts/custom_boards/adzuna.py |
REST API with app_id + app_key authentication |
scripts/custom_boards/theladders.py |
SSR scraper using curl_cffi to parse __NEXT_DATA__ JSON embedded in the page |
scripts/custom_boards/craigslist.py |
RSS feed scraper |