5 Building Blocks: Private AI with n8n, Ollama, MCP

💡

You want private, fast, and cheap. A single VPS with n8n + Ollama + MCP gets you there. Think Apollo 13: tight constraints forced smart engineering. Do the sameship agents that respect data, budgets, and latency.

One‑VPS Architecture

What you’ll learn: How to run n8n, Ollama, and MCP on one VPS with clear boundaries, safe payloads, and portable storage

Start small, design for clarity, and keep binaries out of flows. VPS means virtual private server. n8n is a workflow orchestrator. Ollama runs local LLMs. MCP (Model Context Protocol) discovers tools at runtime.

Stack roles

n8n: orchestration and webhooks
Ollama: local LLMs for text, embeddings, and vision
MCP: dynamic tool discovery without hardcoding

Boundary rules

Control in n8n: n8n handles control flow and retries
Code behind HTTP: custom code runs behind clean HTTP endpoints

Payload policy

Strings, JSON, URLs only in flows
Files external in object storage with signed links

Reference flow

flowchart TD
    A[Client] -->|Webhook| B[n8n]
    B -->|HTTP| C[Ollama]
    B -->|HTTP| D[MCP Servers]
    B -->|Vector Query| E[Vector DB]
    B -->|SQL Query| F[SQL DB]
    B -->|Respond| A

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A trigger
    class B process
    class C,D,E,F action

Deployment steps

Containerize everything with Docker
Reverse proxy in front for TLS and routing
State in Postgres and object storage so the box is disposable

# docker-compose.yml (sketch)
version: '3.9'
services:
  proxy:
    image: traefik:v3
    command: ["--providers.docker", "--entrypoints.web.address=:80", "--entrypoints.websecure.address=:443", "--certificatesresolvers.le.acme.tlschallenge=true"]
    ports: ["80:80", "443:443"]
  n8n:
    image: n8nio/n8n:latest
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - WEBHOOK_URL=https://automation.example.com/
    depends_on: [postgres]
  ollama:
    image: ollama/ollama:latest
    volumes: ["ollama:/root/.ollama"]
  postgres:
    image: postgres:16
    environment: ["POSTGRES_PASSWORD=secret"]
volumes: { ollama: {} }

Sizing guide

MVP: 2 vCPU, 4–8 GB RAM, NVMe SSD, about $8–$18 per month. MVP means minimal viable product. NVMe is a fast SSD
Busy: 4 vCPU, 16 GB RAM, about $20–$40 per month
Vision heavy: CPU is acceptable but slower. Use a GPU only if p95 must be under 5 s. p95 means 95th percentile latency

Backups and retention

Postgres: nightly dump to object storage, keep 7–30 days
n8n executions: auto prune to avoid disk creep

Design for a future migration, but plan to live on one box comfortably for months. Next, make the work event driven.

erDiagram
    Execution ||--o{ ToolCall : has
    VectorDoc ||--o{ ToolCall : cited_by

    Execution {
        int id
        string workflow
        datetime created_at
        string status
    }

    ToolCall {
        int id
        int execution_id
        string tool
        string params
        datetime created_at
    }

    VectorDoc {
        int id
        string title
        string source
        string uri
    }

Webhook First

What you’ll learn: Why webhooks beat polling, how to ack fast, and how to keep handlers safe and small

Events beat crons. You pay only when work happens.

Why webhooks

Lower cost and noise with no empty polling
Lower latency with faster first byte
Simpler scaling by decoupling ingest from processing

Response pattern

Receive the event on a webhook
Acknowledge immediately with HTTP 202
Enqueue a job to a worker or child workflow for the heavy work

{
  "status": 202,
  "message": "Accepted"
}

Security basics

Secret header or HMAC signature, verify before enqueue. HMAC is a keyed hash used to prove authenticity
IP allowlists for vendor webhooks
Rate limits per endpoint

Strings and URLs only

Accept URLs for images and files, then fetch from object storage on demand
Reject base64 blobs in webhooks to keep memory flat

Trigger styles

Polling: higher latency and cost, hidden timeouts, use only for legacy APIs
Webhooks: lower latency and cost, backpressure is visible, use for most external events

flowchart TD
    V[Vendor] -->|Event| W[Webhook]
    W -->|Ack 202| R[Client]
    W -->|Enqueue| Q[Queue]
    Q --> S[Worker]
    S --> D[Done]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class V,R trigger
    class W,Q process
    class S,D action

💡

Tip: keep the webhook handler tiny and deterministic. Push work to a queue within 50 ms and log the job id for traceability

A fast ingest path sets you up for model calls and retrieval.

Ollama Models

What you’ll learn: How to connect n8n to Ollama, choose CPU friendly models, and build a precise RAG pipeline

RAG means retrieval augmented generation. LLM means large language model.

Connect n8n and Ollama

Ollama API at http://ollama:11434
n8n LLM node points to the Ollama host and sets a model per step for reasoning and embeddings

Model choices

Task	Model	Note
Reasoning or chat	Llama 3.1 or 3.2 8B–13B Q4	Good instruction following on CPU. Q4 is quantized weights
Embeddings	mxbai-embed-large	Strong local semantic search
Vision	Llama 3.2 Vision	Works on CPU, slower but private

Latency to expect

First token in about 800–2000 ms on 4 vCPU
Stream at about 5–20 tokens per second, prompt size dominates
Vision about 10–45 s per image, resolution matters

# Quick smoke test
ollama pull llama3:8b
curl -s http://localhost:11434/api/generate -d '{"model":"llama3:8b","prompt":"Say hi in one line."}'

RAG pipeline

Ingest with a loader, clean, and chunk to 300–500 tokens with 10–20 overlap
Embed with mxbai via Ollama, store vectors with metadata
Query and build a hybrid prompt with top k chunks and citations. top k is the number of retrieved chunks
Guardrails with a confidence threshold and an “I do not know” fallback

flowchart TD
    I[Ingest] --> C[Chunk]
    C --> E[Embed]
    E --> V[Vectors]
    Qy[Query] --> R[Retrieve]
    R --> P[Prompt]
    P --> G[Generate]
    G --> A[Answer]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class I,C,E,Qy,R,P,G,A process
    class V action

[Prompt]
You answer using ONLY the provided context. If insufficient, say you don't know.
Context:
{{ $json.context }}
Question:
{{ $json.q }}

Images without blobs

Signed URL in then run vision, store conclusions and the link
OCR for text extraction, drop images after TTL. OCR is optical character recognition

Design RAG like a search product. Precision beats verbosity. Next, add tools and agents.

Agents with MCP

What you’ll learn: How to run a small agent with safe tools, when to route to specialists, and how MCP discovers tools at runtime

Single agent baseline

System prompt defines mission, tone, and refusal policy
Tools include HTTP, vector query, DB read, and small code for edge cases
Memory keeps short chat history and stores long term facts in a vector store

Multi agent on one VPS

Router agent forwards to specialists for docs, billing, and ops, keep hops at or under 3

MCP in practice

List tools at runtime with no hardcoded specs
Add or remove capabilities by deploying or toggling MCP servers

[Agent System Prompt]
You MUST cite sources from vector hits and refuse answers without grounding.
Prefer tools with lower latency; avoid web unless asked.

Security and privacy

Read safe by default for exposed tools
Explicit approval for write or delete operations
Log every tool call with parameters and duration

Tool patterns

Pattern	Pros	Use
Static tools	Predictable and simple	Few and stable tools
MCP tools	Pluggable and flexible	Many and changing tools

flowchart TD
    R[Router Agent] --> D[Docs Agent]
    R --> B[Billing Agent]
    R --> O[Ops Agent]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    class R trigger

A small, sharp toolbox beats a cluttered drawer. Now enforce performance boundaries.

Performance and Boundaries

What you’ll learn: How to measure workflows, choose execution modes, and decide when to write code vs no code

rps means requests per second.

Execution modes

Single process is simple and fine under about 20 rps
Queue mode gives fast webhook acks with workers for long jobs

Benchmark method

Track rps, p95 latency, and error rate per workflow
Separate ingest to ack from process to complete
Sample about 200–1000 requests per scenario

# k6 sketch
k6 run -e URL=https://.../webhook --vus 50 --duration 2m load.js

Bottlenecks and fixes

DB contention then move to Postgres, add indexes, and prune executions
Prompt bloat then shorten context, raise top k quality, and cache
Payload weight then use URLs not blobs and gzip JSON

Code vs no code rules

Use an n8n node when a stable node exists
Write a small service behind HTTP when logic exceeds about 20 lines or is reused across flows
Map data with Set and IF nodes for clear transforms
Profile first and optimize only the hot path when performance is the issue
Prefer no code when non developers must maintain it and document inputs

Task choices

Task	n8n node	Custom code
HTTP to vendor API	Yes
Complex JSON transforms		Yes
Vector search	Yes
Proprietary scoring		Yes

Mini end to end

Ticket system sends a webhook and you ack with 202
Agent queries RAG, drafts a reply, and cites sources
Low confidence escalates via API
Store summary and links, not files
Targets ingest under 100 ms, reply in about 6–20 s on CPU, errors under 1 percent

flowchart TD
    T[Ticket Event] --> H[Webhook]
    H --> A[Ack 202]
    H --> J[Job Queue]
    J --> AG[Agent]
    AG --> RG[RAG]
    RG --> RE[Reply]
    AG -->|Low conf| ES[Escalate]
    RE --> ST[Store Summary]

Rough cost per 1k tickets

Compute about $0.80–$1.80 on a CPU VPS
Storage and egress are low if you prune
People time near zero once stable

Keep forms light, prompts tight, and flows observable. Production follows naturally.

💡

Ship a thin vertical first: webhook to RAG answer to MCP powered tool call. Then iterate, add vision, switch to queue mode, and move any messy logic behind a tiny Rails API. Constraints will make your agents better