n8n Web Scraping + Webhooks Starter Kit: 10 Building Blocks

💡

Ship a scraper that won’t crumble under rate limits, bans, or legal missteps. This starter kit shows the fastest safe path from idea to production.

Why this kit

What you’ll learn: Why reusable n8n workflows beat one-off scripts and how event-first design, community nodes, and Postgres/Supabase improve stability

Modern scraping fails when treated as a one-off script. n8n lets you compose durable pieces instead

Event-first ingestion beats brittle cron polling
Community nodes add battle-tested power without glue code
Postgres/Supabase enables upserts, dedupe, and audit trails

Build once, reuse across teams and targets

Blocks 1–4

What you’ll learn: How to plan compliant targets, trigger scrapes with webhooks, extract data from static pages, and route dynamic sites to the right method

A solid plan and ingest path prevent rework. Then choose the simplest scraper first and escalate only when needed

1) Use cases and compliance

Scrape only what you need and only how you’re allowed to get it

Map scope: goals - targets - fields - frequency
Check ToS (terms of service), robots.txt (crawl rules), and local data laws before any call
Prefer public, non-personal data; avoid paywalled or logged-in areas

“Respect comes first. Throughput comes second.” Choose the long game

A short pre-mortem avoids bans and rework

💡

Legal note: document data sources, consent assumptions, and retention windows to align with internal policy and local laws

2) Webhook ingestion (event-first)

A webhook is an HTTP endpoint that receives events from other systems in real time

Webhook node: trigger scrapes on demand
Response mode: use Immediately for long jobs, or pair with Respond to Webhook for custom results
Validate input: IF - Function/Code nodes to reject malformed payloads early
Fan-out: Split In Batches + Wait to pace downstream requests

Example payload (POST)

{
  "source": "catalog-refresh",
  "urls": [
    "https://example.com/p/sku-123",
    "https://example.com/p/sku-456"
  ],
  "priority": "high",
  "strategy": "auto"
}

Treat webhooks as your API for scraping on demand

flowchart TD
    A[Webhook Trigger] --> B[Validate Input]
    B --> C{Valid?}
    C -->|Yes| D[Fan Out]
    C -->|No| E[Reject]
    D --> F[Wait]
    F --> G[Scrape Path]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    classDef alert fill:#f3e5f5,stroke:#7b1fa2

    class A trigger
    class B,C,D,F process
    class E alert
    class G action

3) Static pages: HTTP + HTML

Default to the simplest path for static pages

HTTP Request: set realistic headers like User-Agent and Accept-Language
HTML node: extract with CSS selectors and emit normalized JSON
Pagination: use a loop with Split In Batches and build next-page URLs

Selector sketch

/* HTML node selectors */
.title: text
.price: text
.breadcrumb a:last-child: text
img.product: attr(src)

This combo is fast, cheap, and easy to debug

4) Dynamic sites and APIs

Switch tactics when JavaScript renders data or anti-bot rules get noisy. A headless browser is an automated browser without a visible UI, useful for rendering JS-heavy pages

Approach	Best for	Trade-offs
HTTP + HTML	Static pages	Fast and low cost, breaks on JS-rendered content
ScrapeNinja	Mixed or JS	Built-in rotation, simple, adds API cost and vendor reliance
Puppeteer	Complex JS flows	Full control and stealth, slower and heavier

Keep both paths and route by domain or error signals

flowchart TD
    A[Pick Target] --> B{JS or Errors?}
    B -->|No| C[HTTP Path]
    B -->|Yes| D{Needs Flow?}
    D -->|No| E[API Service]
    D -->|Yes| F[Headless Path]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B,D process
    class C,E,F action

Transition: with ingestion and strategy paths ready, focus on staying polite at scale and storing clean data

Blocks 5–7

What you’ll learn: How to blend in with proxy rotation, design a Postgres/Supabase schema, and dedupe changes with hashes

5) Rotation, stealth, pacing

Blend in like real traffic and avoid hammering origins

Rotate IPs and user-agents; set per-request headers
Add jittered delays with Wait and cap concurrency per domain
Back off on 429 or 503 responses; escalate method HTTP - JS - API

Header sample

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36
Accept-Language: en-US,en;q=0.9

6) Data model and storage

Treat storage as a product: queryable, auditable, evolvable. Upsert means insert a row or update it if it already exists. Deduplication removes duplicate records

Tables: items, runs, item_changes with keys on url or external_id
Postgres node: use INSERT and UPSERT and store run metadata
Indexes: add on unique keys and common fetch patterns

Minimal schema

create table items (
  id bigserial primary key,
  url text unique,
  title text,
  price numeric,
  hash text,
  updated_at timestamptz default now()
);

create table runs (
  run_id uuid primary key,
  source text,
  started_at timestamptz default now(),
  status text
);

Upsert pattern

insert into items (url, title, price, hash)
values ($1, $2, $3, $4)
on conflict (url) do update
set title = excluded.title,
    price = excluded.price,
    hash  = excluded.hash,
    updated_at = now();

erDiagram
    Run ||--o{ Item : writes
    Item ||--o{ ItemChange : has

    Run {
        int id
        string source
        datetime started_at
        string status
    }

    Item {
        int id
        string url
        string title
        int price
        string hash
        datetime updated_at
    }

    ItemChange {
        int id
        int item_id
        string field
        string old_value
        string new_value
        datetime changed_at
    }

7) Dedup and change detect

Keep only new or changed facts. A hash is a short fingerprint of content used to detect changes

Remove Duplicates: scope by workflow and key on url or hash
Change detection: compute a content hash and compare before writes
Custom rules: Code node for fuzzy or multi-field logic

Hashing example (Code node)

const crypto = require('crypto');
return items.map(i => {
  const s = `${i.json.title}|${i.json.price}|${i.json.url}`;
  i.json.hash = crypto.createHash('sha1').update(s).digest('hex');
  return i;
});

Write only when the hash changes

flowchart TD
    A[Compute Hash] --> B[Lookup Item]
    B --> C{Hash Match?}
    C -->|Yes| D[Skip Write]
    C -->|No| E[Upsert Item]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A,B,C process
    class D,E action

Transition: with clean data flowing, package repeatable workflows and make failures easy to recover from

Blocks 8–10

What you’ll learn: How to ship reusable templates, add observability, and package the kit for your team

8) Workflow templates

Ship templates your team can drop in and run

Template: One URL - respond

Webhook - HTTP - HTML - Respond to Webhook
Validate inputs and map selectors
Return normalized JSON to the caller

Template: Sitemap crawl

Webhook receives sitemap URL
HTTP fetch of sitemap and parse URLs
Split In Batches - HTTP/HTML - DB write

Template: Multi-strategy router

Webhook receives urls and source
IF node checks domain rules and error signals
Route to HTTP path, ScrapeNinja, or Puppeteer and then DB

flowchart TD
    A[Webhook] --> B[Domain Rules]
    B --> C[HTTP Path]
    B --> D[API Service]
    B --> E[Headless Path]
    C --> F[DB Write]
    D --> F
    E --> F

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B process
    class C,D,E,F action

9) Observability and resilience

Assume failure and celebrate recovery. Exponential backoff increases wait time between retries after errors

Retries with backoff and clear final states
Logs to a DB table and metrics to your notifier of choice
Playbooks for 403, 404, 429, 5xx and auto-pause on repeated 429s

Retry sketch

// in a Code node before a request
async function withRetry(fn, tries = 3, base = 2000) {
  for (let i = 0; i < tries; i++) {
    try { return await fn(); } catch (e) {
      if (i === tries - 1) throw e;
      const wait = base * Math.pow(2, i);
      await new Promise(r => setTimeout(r, wait));
    }
  }
}

flowchart TD
    A[Do Request] --> B{Success?}
    B -->|Yes| C[Record OK]
    B -->|No| D[Retry Backoff]
    D --> E{More Tries?}
    E -->|Yes| A
    E -->|No| F[Record Error]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    classDef alert fill:#f3e5f5,stroke:#7b1fa2

    class A,B,D,E process
    class C action
    class F alert

Make reuse the default

Conventions: node naming, env vars, secrets, error paths
Docs: a 5-minute README with inputs, outputs, quotas, and costs
Distribution: export workflows, publish a template bundle, version it

💡

Tip: add example payloads and selector maps to each template so new users can run them in under five minutes

Transition: you now have a repeatable kit you can ship and extend across teams

From one-offs to platform

What you’ll learn: How the ten blocks fit together into a reusable scraping platform for n8n

You now have ten modular blocks that click together cleanly

Event-driven ingestion, right-sized scraping, respectful pacing
Robust storage, real dedupe, and sturdy error paths
Templates your team can ship and extend

Start with the HTTP + HTML path, add headless only where needed, then package the result for others

💡

Next steps: wire a domain-specific template, add domain rules to your router, publish the kit, and layer AI over your Postgres data for search and QA