10 min read

5 Building Blocks: Private AI with n8n, Ollama, MCP

Hero image for 5 Building Blocks: Private AI with n8n, Ollama, MCP
Table of Contents

đź’ˇ

You want private, fast, and cheap. A single VPS with n8n + Ollama + MCP gets you there. Think Apollo 13: tight constraints forced smart engineering. Do the sameship agents that respect data, budgets, and latency.

One‑VPS Architecture

What you’ll learn: How to run n8n, Ollama, and MCP on one VPS with clear boundaries, safe payloads, and portable storage

Start small, design for clarity, and keep binaries out of flows. VPS means virtual private server. n8n is a workflow orchestrator. Ollama runs local LLMs. MCP (Model Context Protocol) discovers tools at runtime.

Stack roles

  • n8n: orchestration and webhooks
  • Ollama: local LLMs for text, embeddings, and vision
  • MCP: dynamic tool discovery without hardcoding

Boundary rules

  • Control in n8n: n8n handles control flow and retries
  • Code behind HTTP: custom code runs behind clean HTTP endpoints

Payload policy

  • Strings, JSON, URLs only in flows
  • Files external in object storage with signed links

Reference flow

flowchart TD
    A[Client] -->|Webhook| B[n8n]
    B -->|HTTP| C[Ollama]
    B -->|HTTP| D[MCP Servers]
    B -->|Vector Query| E[Vector DB]
    B -->|SQL Query| F[SQL DB]
    B -->|Respond| A

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B process
    class C,D,E,F action

Deployment steps

  1. Containerize everything with Docker
  2. Reverse proxy in front for TLS and routing
  3. State in Postgres and object storage so the box is disposable
# docker-compose.yml (sketch)
version: '3.9'
services:
  proxy:
    image: traefik:v3
    command: ["--providers.docker", "--entrypoints.web.address=:80", "--entrypoints.websecure.address=:443", "--certificatesresolvers.le.acme.tlschallenge=true"]
    ports: ["80:80", "443:443"]
  n8n:
    image: n8nio/n8n:latest
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - WEBHOOK_URL=https://automation.example.com/
    depends_on: [postgres]
  ollama:
    image: ollama/ollama:latest
    volumes: ["ollama:/root/.ollama"]
  postgres:
    image: postgres:16
    environment: ["POSTGRES_PASSWORD=secret"]
volumes: { ollama: {} }

Sizing guide

  • MVP: 2 vCPU, 4–8 GB RAM, NVMe SSD, about $8–$18 per month. MVP means minimal viable product. NVMe is a fast SSD
  • Busy: 4 vCPU, 16 GB RAM, about $20–$40 per month
  • Vision heavy: CPU is acceptable but slower. Use a GPU only if p95 must be under 5 s. p95 means 95th percentile latency

Backups and retention

  • Postgres: nightly dump to object storage, keep 7–30 days
  • n8n executions: auto prune to avoid disk creep

Design for a future migration, but plan to live on one box comfortably for months. Next, make the work event driven.

erDiagram
    Execution ||--o{ ToolCall : has
    VectorDoc ||--o{ ToolCall : cited_by

    Execution {
        int id
        string workflow
        datetime created_at
        string status
    }

    ToolCall {
        int id
        int execution_id
        string tool
        string params
        datetime created_at
    }

    VectorDoc {
        int id
        string title
        string source
        string uri
    }

Webhook First

What you’ll learn: Why webhooks beat polling, how to ack fast, and how to keep handlers safe and small

Events beat crons. You pay only when work happens.

Why webhooks

  • Lower cost and noise with no empty polling
  • Lower latency with faster first byte
  • Simpler scaling by decoupling ingest from processing

Response pattern

  1. Receive the event on a webhook
  2. Acknowledge immediately with HTTP 202
  3. Enqueue a job to a worker or child workflow for the heavy work
{
  "status": 202,
  "message": "Accepted"
}

Security basics

  • Secret header or HMAC signature, verify before enqueue. HMAC is a keyed hash used to prove authenticity
  • IP allowlists for vendor webhooks
  • Rate limits per endpoint

Strings and URLs only

  • Accept URLs for images and files, then fetch from object storage on demand
  • Reject base64 blobs in webhooks to keep memory flat

Trigger styles

  • Polling: higher latency and cost, hidden timeouts, use only for legacy APIs
  • Webhooks: lower latency and cost, backpressure is visible, use for most external events
flowchart TD
    V[Vendor] -->|Event| W[Webhook]
    W -->|Ack 202| R[Client]
    W -->|Enqueue| Q[Queue]
    Q --> S[Worker]
    S --> D[Done]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class V,R trigger
    class W,Q process
    class S,D action
đź’ˇ

Tip: keep the webhook handler tiny and deterministic. Push work to a queue within 50 ms and log the job id for traceability

A fast ingest path sets you up for model calls and retrieval.

Ollama Models

What you’ll learn: How to connect n8n to Ollama, choose CPU friendly models, and build a precise RAG pipeline

RAG means retrieval augmented generation. LLM means large language model.

Connect n8n and Ollama

  • Ollama API at http://ollama:11434
  • n8n LLM node points to the Ollama host and sets a model per step for reasoning and embeddings

Model choices

TaskModelNote
Reasoning or chatLlama 3.1 or 3.2 8B–13B Q4Good instruction following on CPU. Q4 is quantized weights
Embeddingsmxbai-embed-largeStrong local semantic search
VisionLlama 3.2 VisionWorks on CPU, slower but private

Latency to expect

  1. First token in about 800–2000 ms on 4 vCPU
  2. Stream at about 5–20 tokens per second, prompt size dominates
  3. Vision about 10–45 s per image, resolution matters
# Quick smoke test
ollama pull llama3:8b
curl -s http://localhost:11434/api/generate -d '{"model":"llama3:8b","prompt":"Say hi in one line."}'

RAG pipeline

  1. Ingest with a loader, clean, and chunk to 300–500 tokens with 10–20 overlap
  2. Embed with mxbai via Ollama, store vectors with metadata
  3. Query and build a hybrid prompt with top k chunks and citations. top k is the number of retrieved chunks
  4. Guardrails with a confidence threshold and an “I do not know” fallback
flowchart TD
    I[Ingest] --> C[Chunk]
    C --> E[Embed]
    E --> V[Vectors]
    Qy[Query] --> R[Retrieve]
    R --> P[Prompt]
    P --> G[Generate]
    G --> A[Answer]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class I,C,E,Qy,R,P,G,A process
    class V action
[Prompt]
You answer using ONLY the provided context. If insufficient, say you don't know.
Context:
{{ $json.context }}
Question:
{{ $json.q }}

Images without blobs

  • Signed URL in then run vision, store conclusions and the link
  • OCR for text extraction, drop images after TTL. OCR is optical character recognition

Design RAG like a search product. Precision beats verbosity. Next, add tools and agents.

Agents with MCP

What you’ll learn: How to run a small agent with safe tools, when to route to specialists, and how MCP discovers tools at runtime

Single agent baseline

  1. System prompt defines mission, tone, and refusal policy
  2. Tools include HTTP, vector query, DB read, and small code for edge cases
  3. Memory keeps short chat history and stores long term facts in a vector store

Multi agent on one VPS

  • Router agent forwards to specialists for docs, billing, and ops, keep hops at or under 3

MCP in practice

  • List tools at runtime with no hardcoded specs
  • Add or remove capabilities by deploying or toggling MCP servers
[Agent System Prompt]
You MUST cite sources from vector hits and refuse answers without grounding.
Prefer tools with lower latency; avoid web unless asked.

Security and privacy

  • Read safe by default for exposed tools
  • Explicit approval for write or delete operations
  • Log every tool call with parameters and duration

Tool patterns

PatternProsUse
Static toolsPredictable and simpleFew and stable tools
MCP toolsPluggable and flexibleMany and changing tools
flowchart TD
    R[Router Agent] --> D[Docs Agent]
    R --> B[Billing Agent]
    R --> O[Ops Agent]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    class R trigger

A small, sharp toolbox beats a cluttered drawer. Now enforce performance boundaries.

Performance and Boundaries

What you’ll learn: How to measure workflows, choose execution modes, and decide when to write code vs no code

rps means requests per second.

Execution modes

  1. Single process is simple and fine under about 20 rps
  2. Queue mode gives fast webhook acks with workers for long jobs

Benchmark method

  • Track rps, p95 latency, and error rate per workflow
  • Separate ingest to ack from process to complete
  • Sample about 200–1000 requests per scenario
# k6 sketch
k6 run -e URL=https://.../webhook --vus 50 --duration 2m load.js

Bottlenecks and fixes

  • DB contention then move to Postgres, add indexes, and prune executions
  • Prompt bloat then shorten context, raise top k quality, and cache
  • Payload weight then use URLs not blobs and gzip JSON

Code vs no code rules

  1. Use an n8n node when a stable node exists
  2. Write a small service behind HTTP when logic exceeds about 20 lines or is reused across flows
  3. Map data with Set and IF nodes for clear transforms
  4. Profile first and optimize only the hot path when performance is the issue
  5. Prefer no code when non developers must maintain it and document inputs

Task choices

Taskn8n nodeCustom code
HTTP to vendor APIYes
Complex JSON transformsYes
Vector searchYes
Proprietary scoringYes

Mini end to end

  1. Ticket system sends a webhook and you ack with 202
  2. Agent queries RAG, drafts a reply, and cites sources
  3. Low confidence escalates via API
  4. Store summary and links, not files
  5. Targets ingest under 100 ms, reply in about 6–20 s on CPU, errors under 1 percent
flowchart TD
    T[Ticket Event] --> H[Webhook]
    H --> A[Ack 202]
    H --> J[Job Queue]
    J --> AG[Agent]
    AG --> RG[RAG]
    RG --> RE[Reply]
    AG -->|Low conf| ES[Escalate]
    RE --> ST[Store Summary]

Rough cost per 1k tickets

  • Compute about $0.80–$1.80 on a CPU VPS
  • Storage and egress are low if you prune
  • People time near zero once stable

Keep forms light, prompts tight, and flows observable. Production follows naturally.

đź’ˇ

Ship a thin vertical first: webhook to RAG answer to MCP powered tool call. Then iterate, add vision, switch to queue mode, and move any messy logic behind a tiny Rails API. Constraints will make your agents better

đź“§