9 min read

n8n Web Scraping + Webhooks Starter Kit: 10 Building Blocks

Hero image for n8n Web Scraping + Webhooks Starter Kit: 10 Building Blocks
Table of Contents

đź’ˇ

Ship a scraper that won’t crumble under rate limits, bans, or legal missteps. This starter kit shows the fastest safe path from idea to production.

Why this kit

What you’ll learn: Why reusable n8n workflows beat one-off scripts and how event-first design, community nodes, and Postgres/Supabase improve stability

Modern scraping fails when treated as a one-off script. n8n lets you compose durable pieces instead

  • Event-first ingestion beats brittle cron polling
  • Community nodes add battle-tested power without glue code
  • Postgres/Supabase enables upserts, dedupe, and audit trails

Build once, reuse across teams and targets


Blocks 1–4

What you’ll learn: How to plan compliant targets, trigger scrapes with webhooks, extract data from static pages, and route dynamic sites to the right method

A solid plan and ingest path prevent rework. Then choose the simplest scraper first and escalate only when needed

1) Use cases and compliance

Scrape only what you need and only how you’re allowed to get it

  • Map scope: goals - targets - fields - frequency
  • Check ToS (terms of service), robots.txt (crawl rules), and local data laws before any call
  • Prefer public, non-personal data; avoid paywalled or logged-in areas

“Respect comes first. Throughput comes second.” Choose the long game

A short pre-mortem avoids bans and rework

đź’ˇ

Legal note: document data sources, consent assumptions, and retention windows to align with internal policy and local laws

2) Webhook ingestion (event-first)

A webhook is an HTTP endpoint that receives events from other systems in real time

  • Webhook node: trigger scrapes on demand
  • Response mode: use Immediately for long jobs, or pair with Respond to Webhook for custom results
  • Validate input: IF - Function/Code nodes to reject malformed payloads early
  • Fan-out: Split In Batches + Wait to pace downstream requests

Example payload (POST)

{
  "source": "catalog-refresh",
  "urls": [
    "https://example.com/p/sku-123",
    "https://example.com/p/sku-456"
  ],
  "priority": "high",
  "strategy": "auto"
}

Treat webhooks as your API for scraping on demand

flowchart TD
    A[Webhook Trigger] --> B[Validate Input]
    B --> C{Valid?}
    C -->|Yes| D[Fan Out]
    C -->|No| E[Reject]
    D --> F[Wait]
    F --> G[Scrape Path]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    classDef alert fill:#f3e5f5,stroke:#7b1fa2

    class A trigger
    class B,C,D,F process
    class E alert
    class G action

3) Static pages: HTTP + HTML

Default to the simplest path for static pages

  • HTTP Request: set realistic headers like User-Agent and Accept-Language
  • HTML node: extract with CSS selectors and emit normalized JSON
  • Pagination: use a loop with Split In Batches and build next-page URLs

Selector sketch

/* HTML node selectors */
.title: text
.price: text
.breadcrumb a:last-child: text
img.product: attr(src)

This combo is fast, cheap, and easy to debug

4) Dynamic sites and APIs

Switch tactics when JavaScript renders data or anti-bot rules get noisy. A headless browser is an automated browser without a visible UI, useful for rendering JS-heavy pages

ApproachBest forTrade-offs
HTTP + HTMLStatic pagesFast and low cost, breaks on JS-rendered content
ScrapeNinjaMixed or JSBuilt-in rotation, simple, adds API cost and vendor reliance
PuppeteerComplex JS flowsFull control and stealth, slower and heavier

Keep both paths and route by domain or error signals

flowchart TD
    A[Pick Target] --> B{JS or Errors?}
    B -->|No| C[HTTP Path]
    B -->|Yes| D{Needs Flow?}
    D -->|No| E[API Service]
    D -->|Yes| F[Headless Path]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B,D process
    class C,E,F action

Transition: with ingestion and strategy paths ready, focus on staying polite at scale and storing clean data


Blocks 5–7

What you’ll learn: How to blend in with proxy rotation, design a Postgres/Supabase schema, and dedupe changes with hashes

5) Rotation, stealth, pacing

Blend in like real traffic and avoid hammering origins

  • Rotate IPs and user-agents; set per-request headers
  • Add jittered delays with Wait and cap concurrency per domain
  • Back off on 429 or 503 responses; escalate method HTTP - JS - API

Header sample

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36
Accept-Language: en-US,en;q=0.9

6) Data model and storage

Treat storage as a product: queryable, auditable, evolvable. Upsert means insert a row or update it if it already exists. Deduplication removes duplicate records

  • Tables: items, runs, item_changes with keys on url or external_id
  • Postgres node: use INSERT and UPSERT and store run metadata
  • Indexes: add on unique keys and common fetch patterns

Minimal schema

create table items (
  id bigserial primary key,
  url text unique,
  title text,
  price numeric,
  hash text,
  updated_at timestamptz default now()
);

create table runs (
  run_id uuid primary key,
  source text,
  started_at timestamptz default now(),
  status text
);

Upsert pattern

insert into items (url, title, price, hash)
values ($1, $2, $3, $4)
on conflict (url) do update
set title = excluded.title,
    price = excluded.price,
    hash  = excluded.hash,
    updated_at = now();
erDiagram
    Run ||--o{ Item : writes
    Item ||--o{ ItemChange : has

    Run {
        int id
        string source
        datetime started_at
        string status
    }

    Item {
        int id
        string url
        string title
        int price
        string hash
        datetime updated_at
    }

    ItemChange {
        int id
        int item_id
        string field
        string old_value
        string new_value
        datetime changed_at
    }

7) Dedup and change detect

Keep only new or changed facts. A hash is a short fingerprint of content used to detect changes

  • Remove Duplicates: scope by workflow and key on url or hash
  • Change detection: compute a content hash and compare before writes
  • Custom rules: Code node for fuzzy or multi-field logic

Hashing example (Code node)

const crypto = require('crypto');
return items.map(i => {
  const s = `${i.json.title}|${i.json.price}|${i.json.url}`;
  i.json.hash = crypto.createHash('sha1').update(s).digest('hex');
  return i;
});

Write only when the hash changes

flowchart TD
    A[Compute Hash] --> B[Lookup Item]
    B --> C{Hash Match?}
    C -->|Yes| D[Skip Write]
    C -->|No| E[Upsert Item]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A,B,C process
    class D,E action

Transition: with clean data flowing, package repeatable workflows and make failures easy to recover from


Blocks 8–10

What you’ll learn: How to ship reusable templates, add observability, and package the kit for your team

8) Workflow templates

Ship templates your team can drop in and run

Template: One URL - respond

  1. Webhook - HTTP - HTML - Respond to Webhook
  2. Validate inputs and map selectors
  3. Return normalized JSON to the caller

Template: Sitemap crawl

  1. Webhook receives sitemap URL
  2. HTTP fetch of sitemap and parse URLs
  3. Split In Batches - HTTP/HTML - DB write

Template: Multi-strategy router

  1. Webhook receives urls and source
  2. IF node checks domain rules and error signals
  3. Route to HTTP path, ScrapeNinja, or Puppeteer and then DB
flowchart TD
    A[Webhook] --> B[Domain Rules]
    B --> C[HTTP Path]
    B --> D[API Service]
    B --> E[Headless Path]
    C --> F[DB Write]
    D --> F
    E --> F

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B process
    class C,D,E,F action

9) Observability and resilience

Assume failure and celebrate recovery. Exponential backoff increases wait time between retries after errors

  • Retries with backoff and clear final states
  • Logs to a DB table and metrics to your notifier of choice
  • Playbooks for 403, 404, 429, 5xx and auto-pause on repeated 429s

Retry sketch

// in a Code node before a request
async function withRetry(fn, tries = 3, base = 2000) {
  for (let i = 0; i < tries; i++) {
    try { return await fn(); } catch (e) {
      if (i === tries - 1) throw e;
      const wait = base * Math.pow(2, i);
      await new Promise(r => setTimeout(r, wait));
    }
  }
}
flowchart TD
    A[Do Request] --> B{Success?}
    B -->|Yes| C[Record OK]
    B -->|No| D[Retry Backoff]
    D --> E{More Tries?}
    E -->|Yes| A
    E -->|No| F[Record Error]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    classDef alert fill:#f3e5f5,stroke:#7b1fa2

    class A,B,D,E process
    class C action
    class F alert

10) Package and share

Make reuse the default

  • Conventions: node naming, env vars, secrets, error paths
  • Docs: a 5-minute README with inputs, outputs, quotas, and costs
  • Distribution: export workflows, publish a template bundle, version it
đź’ˇ

Tip: add example payloads and selector maps to each template so new users can run them in under five minutes

Transition: you now have a repeatable kit you can ship and extend across teams


From one-offs to platform

What you’ll learn: How the ten blocks fit together into a reusable scraping platform for n8n

You now have ten modular blocks that click together cleanly

  • Event-driven ingestion, right-sized scraping, respectful pacing
  • Robust storage, real dedupe, and sturdy error paths
  • Templates your team can ship and extend

Start with the HTTP + HTML path, add headless only where needed, then package the result for others

đź’ˇ

Next steps: wire a domain-specific template, add domain rules to your router, publish the kit, and layer AI over your Postgres data for search and QA

đź“§