1. Why automate now
What you’ll learn: Why modern sites require browser-aware scraping and how n8n keeps scrapers reliable
Modern sites rarely ship data in HTML. They stream content with JavaScript, hide behind logins, and enforce rate limits
Think of NASA’s checklists on Apollo 13. Stepwise discipline saved the mission. The same mindset keeps scrapers alive
- JavaScript-heavy pages need rendering, not just fetching
- Login flows require sessions, cookies, and redirects
- Staying unblocked needs pacing, realistic headers, and proxy hygiene
In short, n8n browser automation turns “just scrape it” into a robust workflow that blends control with restraint
Focus on outcomes: pick the lightest tool that gets the data reliably, then layer anti-blocking and compliance from day one
Transition: With the challenge framed, let’s map the options inside n8n from lightest to heaviest
flowchart TD
A[Start] --> B{HTML has data}
B -->|Yes| C[HTTP + Extract]
B -->|No| D{Needs JS render}
D -->|Low auth| E[Sidecar Browser]
D -->|Spiky load| F[Managed Browser]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A,B trigger
class C,E,F action
class D process
2. n8n options
What you’ll learn: When to use HTTP Request, HTML Extract, or a real browser in n8n
Two native tools go far before you touch a real browser. Use them first
2.1 HTTP Request: default
Short, fast, and cost‑efficient. Ideal when content is server‑rendered or an API exists
- Configure method, URL, and query params
- Set headers to look like a real browser
- Add pagination and retries
HTTP Request (n8n)
Method: GET
URL: https://example.com/search?q={{$json.query}}
Headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36
Accept-Language: en-US,en;q=0.9
Accept: text/html,application/xhtml+xml
Response: String -> JSON (if API)
- Pros minimal resources, easy to scale, clean outputs
- Cons no JS execution, struggles with anti‑bot and complex auth
One node can cover a surprising amount of ground
2.2 HTML Extract: structure
Pair it with HTTP Request to turn raw HTML into fields. “Rendering” means executing JavaScript to build the DOM (document object model)
- Selectors CSS selectors for multiple fields per page
- Transform map attributes like href or src, trim text, normalize
- Output JSON arrays ready for databases or spreadsheets
HTML Extract (n8n)
Source: {{$json.body}}
Selectors:
title: h1.product-title
price: .price > span
link: a.product-card::attr(href)
Use it whenever the HTML already contains the data you need
2.3 When HTTP stops
Signals that you need a real browser
- JS-only content appears after SPA routes or virtual scrolling. SPA means single‑page app
- Login complexity uses CSRF tokens (anti‑forgery), SSO redirects (single sign‑on), or WebAuthn (key‑based login)
- Anti‑bot shows interstitials, challenges, or empty shells
When you see these, escalate to rendering in n8n with a browser
Transition: If HTTP cannot reach the data, choose the lightest browser pattern that works
flowchart TD
A[Content check] --> B{API present}
B -->|Yes| C[HTTP Request]
B -->|No| D{Server HTML}
D -->|Yes| E[HTTP + Extract]
D -->|No| F[Browser Needed]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A,B,D process
class C,E,F action
3. Browser patterns
What you’ll learn: How to run Puppeteer or Playwright with n8n and when to go managed
Puppeteer and Playwright unlock rendering, sessions, and complex flows. “Headless” means running a browser without a visible UI
3.1 When to switch
- JS-heavy sites React, Vue, Angular, or shadow DOM (encapsulated components)
- Interactive flows click‑through wizards, file downloads, or PDF generation
- Hard defenses checks for navigator.webdriver, human‑like timings, or cookie integrity
Render when necessary, fetch when possible
3.2 Sidecar container
Run Chromium next to n8n for low‑latency control
# docker-compose.yml (excerpt)
services:
n8n:
image: n8nio/n8n:latest
depends_on: [browser]
environment:
N8N_METRICS: true
browser:
image: mcr.microsoft.com/playwright:v1.45.0-jammy
command: ["bash", "-lc", "npx playwright install --with-deps && node -e 'console.log(\"ready\")'"]
- Pros data stays in your VPC, predictable costs, fast
- Cons you manage patches, headless flags, fonts, and memory
Great for steady workloads with strict data boundaries
3.3 Managed browsers
Offload browsers to an API with Browserless or Apify
HTTP Request -> Browserless
Method: POST
URL: https://chrome.browserless.io/content?token={{$env.BROWSERLESS_TOKEN}}
Body: {
"url": "https://portal.example.com/report",
"gotoOptions": {"waitUntil":"networkidle2"},
"stealth": true
}
HTTP Request -> Apify Actor
Method: POST
URL: https://api.apify.com/v2/acts/{actorId}/runs?token={{$env.APIFY_TOKEN}}
Body: { "startUrls": [{"url": "https://site.example"}], "maxRequestsPerCrawl": 50 }
- Pros zero Chrome ops, elastic scale, anti‑detection features
- Cons network hop cost, per‑run pricing, vendor lock‑in
Use when workloads spike or sites fight hard
3.4 Quick choices
| Option | Best for | Trade‑offs |
|---|---|---|
| HTTP + Extract | Static pages and open APIs | No JS, weaker vs bot defenses |
| Sidecar | Private data and low latency | You patch and monitor |
| Managed | Bursty loads and PDFs | External dependency and cost |
Rule of thumb: start with HTTP Request + HTML Extract. If a single page needs JS, isolate that step behind a browser call and keep the rest HTTP‑only
Transition: Next, let’s assemble an end‑to‑end example that mixes both paths
flowchart TD
A[Trigger] --> B{Login via HTTP}
B -->|Works| C[Get Report]
B -->|Fails| D[Use Browser]
C --> E[Download PDF]
D --> E[Download PDF]
E --> F[Drive Upload]
F --> G[Slack Notify]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A trigger
class B process
class C,D,E,F,G action
4. Login to PDF flow
What you’ll learn: A resilient n8n workflow to log in, fetch a report, save a PDF, and notify the team
4.1 Scenario
- Daily report sits behind a login
- Navigate to report and wait for generation
- Download PDF and archive to Drive
Design for retries and clear logs
4.2 Nodes overview
- Cron Trigger 07:15 UTC daily
- Credentials n8n encrypted creds for user and pass or OAuth
- Branch try HTTP login then fall back to browser if needed
- Navigate to report page and wait for selector or network idle. networkidle2 means minimal active requests
- Download PDF verify bytes and checksum
- Google Drive save with timestamp
- Slack send metadata
4.3 HTTP‑first login
HTTP Request (POST)
URL: https://portal.example.com/login
Body: { "username": {{$cred.user}}, "password": {{$cred.pass}} }
Options: Follow Redirects = true, Send Cookies = true
Then -> HTTP Request (GET) dashboard with cookies
Then -> HTML Extract selectors for report link
- When it works classic session cookies
- Fallback switch to the browser path on missing content
Keep this path as your default for speed and cost
4.4 Browser fallback
Steps
- Go to login page
- Wait for user and pass fields
- Type credentials and submit
- Wait for navigation to complete
- Open reports page
- Wait for download button
- Click download and wait for PDF
- Return bytes buffer
HTTP Request -> Browserless (or Puppeteer node)
Action: goto('https://portal.example.com/login')
Steps:
- waitForSelector('#user')
- type('#user', {{$cred.user}})
- type('#pass', {{$cred.pass}})
- click('button[type=submit]')
- waitForNavigation({ waitUntil: 'networkidle2' })
- click('a[href*="/reports"]')
- waitForSelector('button.download-pdf')
- click('button.download-pdf')
- waitForResponse(/\.pdf$/)
- return Buffer(pdfBytes)
- Tip enable stealth and a realistic viewport
- Optimization store cookies to reuse sessions next run
4.5 Save to Google Drive
Google Drive (Upload)
Filename: report-{{$now}}.pdf
Mime Type: application/pdf
Content: {{$json.pdfBytes}}
Folder: /Reports/PortalA/
- Auth service account or OAuth with least privilege
- Traceability append execution ID in metadata
4.6 Errors and retries
- Wrap outbound calls with Try and Catch using Error Trigger or IF nodes
- Handle 429 and 5xx with exponential backoff and jitter. Jitter adds small random delays
- Emit metrics for success rate, bytes fetched, and median latency
Backoff policy
base: 2000ms
factor: 2.0
jitter: +/- 20%
maxDelay: 60000ms
flowchart TD
A[Try HTTP] --> B{Content found}
B -->|Yes| C[PDF Download]
B -->|No| D[Browser Path]
D --> C
C --> E[Drive Upload]
E --> F[Slack Notify]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A,B process
class C,D,E,F action
Performance tip: cache stable pages, reuse sessions, and split in batches to cap QPS. QPS means queries per second
5. Avoid blocks and comply
What you’ll learn: Practical anti‑blocking tactics and governance for compliant scraping in n8n
You want durability, not drama. Blend technical tact with policy sense
5.1 Rotating proxies
| Type | Best for | Notes |
|---|---|---|
| None | Public docs and low risk | Fast and cheapest |
| Datacenter | Mild defenses and bulk pages | Good speed, moderate cost |
| Residential | Strong defenses and commerce | Higher cost, higher trust |
- Rotate IPs per domain and per session
- Warm pools slowly to avoid spikes
- Choose least invasive option that still works
5.2 Realistic headers and sessions
- User‑Agent current and plausible desktop or mobile mix
- Accept‑Language and Referer align with the user agent
- Cookies persist between runs and reuse valid sessions
Headers
User-Agent: Chrome/120 Windows
Accept-Language: en-US,en;q=0.9
Referer: https://portal.example.com/
Consistency beats randomness that looks fake
5.3 Rate limiting and backoff
- Batch requests with Split In Batches to cap QPS
- Insert Wait with random delays. Jitter spreads load
- Treat 429 as a signal, not a failure
IF (status == 429) -> Wait(5s * 2^retries + jitter) -> Retry (max 5)
Going slower often gets you there faster
5.4 CAPTCHAs and hard blocks
- Do not brute‑force captchas that is a policy boundary
- Prefer logins or APIs provided by the site
- Stop and reassess if you hit walls repeatedly
Choose sustainability over short‑term wins
5.5 Compliance essentials
- robots.txt check it and avoid disallowed paths
- Terms of Service respect no‑scrape clauses when present
- Personal data minimize, mask, and store securely
- Governance audit logs, controlled credentials, approvals
This guide complements a general web scraping primer by focusing on the n8n browser layer and avoiding blocks. Document purpose, scope, and legal basis before you run at scale
Final checklist: HTTP‑first, render only where needed, rotate politely, pace with backoff, cache aggressively, log everything, review robots.txt and ToS, protect credentials, archive outputs with lineage
erDiagram
Run ||--o{ ReportFile : has
Run ||--o{ Session : uses
Run {
int id
string status
datetime started_at
int duration_ms
}
ReportFile {
int id
int run_id
string name
int size
datetime created_at
}
Session {
int id
string domain
string cookies
datetime updated_at
}
Transition: You now have a clear path from lightweight HTTP to robust browser automation in n8n, with patterns that scale and stay compliant