Ship a scraper that won’t crumble under rate limits, bans, or legal missteps. This starter kit shows the fastest safe path from idea to production.
Why this kit
What you’ll learn: Why reusable n8n workflows beat one-off scripts and how event-first design, community nodes, and Postgres/Supabase improve stability
Modern scraping fails when treated as a one-off script. n8n lets you compose durable pieces instead
- Event-first ingestion beats brittle cron polling
- Community nodes add battle-tested power without glue code
- Postgres/Supabase enables upserts, dedupe, and audit trails
Build once, reuse across teams and targets
Blocks 1–4
What you’ll learn: How to plan compliant targets, trigger scrapes with webhooks, extract data from static pages, and route dynamic sites to the right method
A solid plan and ingest path prevent rework. Then choose the simplest scraper first and escalate only when needed
1) Use cases and compliance
Scrape only what you need and only how you’re allowed to get it
- Map scope: goals - targets - fields - frequency
- Check ToS (terms of service), robots.txt (crawl rules), and local data laws before any call
- Prefer public, non-personal data; avoid paywalled or logged-in areas
“Respect comes first. Throughput comes second.” Choose the long game
A short pre-mortem avoids bans and rework
Legal note: document data sources, consent assumptions, and retention windows to align with internal policy and local laws
2) Webhook ingestion (event-first)
A webhook is an HTTP endpoint that receives events from other systems in real time
- Webhook node: trigger scrapes on demand
- Response mode: use Immediately for long jobs, or pair with Respond to Webhook for custom results
- Validate input: IF - Function/Code nodes to reject malformed payloads early
- Fan-out: Split In Batches + Wait to pace downstream requests
Example payload (POST)
{
"source": "catalog-refresh",
"urls": [
"https://example.com/p/sku-123",
"https://example.com/p/sku-456"
],
"priority": "high",
"strategy": "auto"
}
Treat webhooks as your API for scraping on demand
flowchart TD
A[Webhook Trigger] --> B[Validate Input]
B --> C{Valid?}
C -->|Yes| D[Fan Out]
C -->|No| E[Reject]
D --> F[Wait]
F --> G[Scrape Path]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
classDef alert fill:#f3e5f5,stroke:#7b1fa2
class A trigger
class B,C,D,F process
class E alert
class G action
3) Static pages: HTTP + HTML
Default to the simplest path for static pages
- HTTP Request: set realistic headers like User-Agent and Accept-Language
- HTML node: extract with CSS selectors and emit normalized JSON
- Pagination: use a loop with Split In Batches and build next-page URLs
Selector sketch
/* HTML node selectors */
.title: text
.price: text
.breadcrumb a:last-child: text
img.product: attr(src)
This combo is fast, cheap, and easy to debug
4) Dynamic sites and APIs
Switch tactics when JavaScript renders data or anti-bot rules get noisy. A headless browser is an automated browser without a visible UI, useful for rendering JS-heavy pages
| Approach | Best for | Trade-offs |
|---|---|---|
| HTTP + HTML | Static pages | Fast and low cost, breaks on JS-rendered content |
| ScrapeNinja | Mixed or JS | Built-in rotation, simple, adds API cost and vendor reliance |
| Puppeteer | Complex JS flows | Full control and stealth, slower and heavier |
Keep both paths and route by domain or error signals
flowchart TD
A[Pick Target] --> B{JS or Errors?}
B -->|No| C[HTTP Path]
B -->|Yes| D{Needs Flow?}
D -->|No| E[API Service]
D -->|Yes| F[Headless Path]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A trigger
class B,D process
class C,E,F action
Transition: with ingestion and strategy paths ready, focus on staying polite at scale and storing clean data
Blocks 5–7
What you’ll learn: How to blend in with proxy rotation, design a Postgres/Supabase schema, and dedupe changes with hashes
5) Rotation, stealth, pacing
Blend in like real traffic and avoid hammering origins
- Rotate IPs and user-agents; set per-request headers
- Add jittered delays with Wait and cap concurrency per domain
- Back off on 429 or 503 responses; escalate method HTTP - JS - API
Header sample
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36
Accept-Language: en-US,en;q=0.9
6) Data model and storage
Treat storage as a product: queryable, auditable, evolvable. Upsert means insert a row or update it if it already exists. Deduplication removes duplicate records
- Tables: items, runs, item_changes with keys on url or external_id
- Postgres node: use INSERT and UPSERT and store run metadata
- Indexes: add on unique keys and common fetch patterns
Minimal schema
create table items (
id bigserial primary key,
url text unique,
title text,
price numeric,
hash text,
updated_at timestamptz default now()
);
create table runs (
run_id uuid primary key,
source text,
started_at timestamptz default now(),
status text
);
Upsert pattern
insert into items (url, title, price, hash)
values ($1, $2, $3, $4)
on conflict (url) do update
set title = excluded.title,
price = excluded.price,
hash = excluded.hash,
updated_at = now();
erDiagram
Run ||--o{ Item : writes
Item ||--o{ ItemChange : has
Run {
int id
string source
datetime started_at
string status
}
Item {
int id
string url
string title
int price
string hash
datetime updated_at
}
ItemChange {
int id
int item_id
string field
string old_value
string new_value
datetime changed_at
}
7) Dedup and change detect
Keep only new or changed facts. A hash is a short fingerprint of content used to detect changes
- Remove Duplicates: scope by workflow and key on url or hash
- Change detection: compute a content hash and compare before writes
- Custom rules: Code node for fuzzy or multi-field logic
Hashing example (Code node)
const crypto = require('crypto');
return items.map(i => {
const s = `${i.json.title}|${i.json.price}|${i.json.url}`;
i.json.hash = crypto.createHash('sha1').update(s).digest('hex');
return i;
});
Write only when the hash changes
flowchart TD
A[Compute Hash] --> B[Lookup Item]
B --> C{Hash Match?}
C -->|Yes| D[Skip Write]
C -->|No| E[Upsert Item]
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A,B,C process
class D,E action
Transition: with clean data flowing, package repeatable workflows and make failures easy to recover from
Blocks 8–10
What you’ll learn: How to ship reusable templates, add observability, and package the kit for your team
8) Workflow templates
Ship templates your team can drop in and run
Template: One URL - respond
- Webhook - HTTP - HTML - Respond to Webhook
- Validate inputs and map selectors
- Return normalized JSON to the caller
Template: Sitemap crawl
- Webhook receives sitemap URL
- HTTP fetch of sitemap and parse URLs
- Split In Batches - HTTP/HTML - DB write
Template: Multi-strategy router
- Webhook receives urls and source
- IF node checks domain rules and error signals
- Route to HTTP path, ScrapeNinja, or Puppeteer and then DB
flowchart TD
A[Webhook] --> B[Domain Rules]
B --> C[HTTP Path]
B --> D[API Service]
B --> E[Headless Path]
C --> F[DB Write]
D --> F
E --> F
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A trigger
class B process
class C,D,E,F action
9) Observability and resilience
Assume failure and celebrate recovery. Exponential backoff increases wait time between retries after errors
- Retries with backoff and clear final states
- Logs to a DB table and metrics to your notifier of choice
- Playbooks for 403, 404, 429, 5xx and auto-pause on repeated 429s
Retry sketch
// in a Code node before a request
async function withRetry(fn, tries = 3, base = 2000) {
for (let i = 0; i < tries; i++) {
try { return await fn(); } catch (e) {
if (i === tries - 1) throw e;
const wait = base * Math.pow(2, i);
await new Promise(r => setTimeout(r, wait));
}
}
}
flowchart TD
A[Do Request] --> B{Success?}
B -->|Yes| C[Record OK]
B -->|No| D[Retry Backoff]
D --> E{More Tries?}
E -->|Yes| A
E -->|No| F[Record Error]
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
classDef alert fill:#f3e5f5,stroke:#7b1fa2
class A,B,D,E process
class C action
class F alert
10) Package and share
Make reuse the default
- Conventions: node naming, env vars, secrets, error paths
- Docs: a 5-minute README with inputs, outputs, quotas, and costs
- Distribution: export workflows, publish a template bundle, version it
Tip: add example payloads and selector maps to each template so new users can run them in under five minutes
Transition: you now have a repeatable kit you can ship and extend across teams
From one-offs to platform
What you’ll learn: How the ten blocks fit together into a reusable scraping platform for n8n
You now have ten modular blocks that click together cleanly
- Event-driven ingestion, right-sized scraping, respectful pacing
- Robust storage, real dedupe, and sturdy error paths
- Templates your team can ship and extend
Start with the HTTP + HTML path, add headless only where needed, then package the result for others
Next steps: wire a domain-specific template, add domain rules to your router, publish the kit, and layer AI over your Postgres data for search and QA