10 min read

Private AI for Healthcare on a Budget

Hero image for Private AI for Healthcare on a Budget
Table of Contents

Private AI for Healthcare

What you’ll learn:

  • How to ship private AI without sending PHI to public LLM APIs
  • The local stack: n8n, Ollama, Rails MCP, and vector DB
  • How to keep workflows auditable and HIPAA aware

Healthcare needs AI that respects PHI (protected health information), runs fast, and stays affordable. You can ship that today with local, auditable workflows

💡

In 1854, Florence Nightingale turned wartime chaos into clarity using simple counts and clean process. When budgets are tight and stakes are high, rigor beats flash. Private AI in healthcare follows the same playbook: keep data close, flows simple, and controls explicit

  • Goal: ship useful AI without sending PHI to public LLM APIs
  • Stack: n8n for orchestration, Ollama for local models, Rails MCP for tools, vector DB for retrieval
  • Outcome: auditable, HIPAA aware workflows that stay inside your network

Skip theory. Build small, safe loops that deliver value in days, not quarters

Next, see the stack and how data flows through clear trust boundaries


Stack and Architecture

What you’ll learn:

  • Why this local stack protects PHI and reduces latency
  • How the components connect in a simple flow
  • A minimal docker compose for local development

Why this stack

  • n8n: visual workflows, triggers, retries, secrets, role based access control (RBAC), and audit
  • Ollama: local LLMs with quantization and a simple HTTP API
  • Rails MCP: opinionated web stack plus least privilege tool access via model context protocol (MCP)
  • Vector DB: semantic search for offline retrieval augmented generation (RAG) using Qdrant or Weaviate

Compare options

DecisionCloud LLM APIsLocal Stack
PHI handlingData leaves VPC (virtual private cloud)PHI stays on premises
LatencyNetwork round tripLAN speed
Cost at scalePer token feesFixed capital spend with low operating spend
AuditabilityVendor black boxFull local logs
Lock in riskHighLow

Takeaway: choose local when privacy, predictability, and control matter more than convenience

Architecture flow

flowchart TD
    U1[Clinician] --> N[n8n]
    U2[Patient] --> N
    N --> V[Vector DB]
    N --> R[Rails MCP]
    R --> T1[EHR Read]
    R --> T2[Policy Lookup]
    R --> T3[Calc]
    N --> O[Ollama]
    O --> L[Audit Logs]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    classDef alert fill:#f3e5f5,stroke:#7b1fa2

    class U1,U2 trigger
    class N process
    class V,R,O,L action
    class T1,T2,T3 action

Minimal docker compose (dev)

version: "3.9"
services:
  n8n:
    image: n8nio/n8n:latest
    environment:
      - N8N_SECURE_COOKIE=true
      - N8N_USER_MANAGEMENT_DISABLED=false
      - N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}
    ports: ["5678:5678"]
    depends_on: [qdrant, ollama, rails]

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports: ["11434:11434"]
    command: ["serve"]

  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333"]
    volumes:
      - qdrant:/qdrant/storage

  rails:
    build: ./rails_mcp
    environment:
      - RAILS_ENV=production
      - SECRET_KEY_BASE=${SECRET_KEY_BASE}
    ports: ["3000:3000"]

volumes:
  ollama:
  qdrant:

Network and secrets

  • Use private subnets and block egress on Ollama except updates
  • Terminate TLS at an internal proxy and require mutual TLS (mTLS) between services
  • Store secrets in a vault, not env files

Keep each service in its own security group to enforce least privilege network paths

Next, apply PHI safety and HIPAA controls to every flow


PHI Safety and HIPAA

What you’ll learn:

  • What counts as PHI and where it is allowed
  • How to enforce trust boundaries and redaction
  • How to use encryption, RBAC, and audits

PHI scope and placement

  • PHI: any data that links to an individual health status, care, or payment
  • Allowed: EHR, secure queues, encrypted n8n execution context, protected Rails database
  • Not allowed: vendor telemetry, error trackers without business associate agreements (BAAs), public LLM APIs

Default to deny. Then open only the narrow paths you can defend in an audit

Trust boundaries and redaction

  1. Ingest then redact then enrich then answer
  2. Strip identifiers early such as name, date of birth, medical record number, address, phone, exact dates
  3. Use scoped context windows and pass only the slices needed for the task
// n8n Code node (TypeScript)
const phi = $json;
const scrub = (s:string) => s
  .replace(/[A-Z][a-z]+\s[A-Z][a-z]+/g, "[name]")
  .replace(/\b\d{2}\/\d{2}\/\d{4}\b/g, "[date]")
  .replace(/\b\d{10}\b/g, "[phone]");
return { data: scrub(JSON.stringify(phi)) };
  • Validate de identification with unit tests and spot checks
  • Maintain a PHI dictionary per locale to catch edge cases
  • Keep the raw payload in a sealed store and run workflows on redacted copies
flowchart TD
    A[Ingest] --> B[Redact]
    B --> C[Enrich]
    C --> D[Answer]
    D --> E[Log]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32

    class A,B,C,D,E process

Encryption, RBAC, and audits

  • Encryption: TLS 1.2 plus in transit and AES 256 at rest, rotate keys on a schedule
  • RBAC: roles in n8n, scopes per MCP tool, database row level security in Rails
  • Audit: immutable centralized logs with minimal PHI including actor, action, and reason
{
  "ts": "2025-12-22T14:10:05Z",
  "actor": "svc:mcp-doc-summary",
  "patient_ref": "hash:5f2c…",
  "tool": "ehr.read_encounter",
  "purpose": "clinical_note_summarization",
  "workflow": "docs_summarize_v3",
  "result": "success"
}

Add policy checks in flow. If purpose is not a permitted use, stop the run and alert

MCP least privilege example

# rails_mcp/config/tools.yml
- name: ehr.read_encounter
  method: GET
  path: /fhir/Encounter/{id}
  scopes: ["read:encounter"]
  pii_return: minimal
- name: policy.lookup
  method: GET
  path: /policies/{section}
  scopes: ["read:policy"]
  pii_return: none
  • Each tool declares scopes and PHI exposure class
  • The MCP server enforces purpose binding and scope checks
  • Tools with write access require human approval gates in n8n
💡

Practice least privilege end to end: scope tools, redact inputs, cap outputs, and log the reason for every access

With safeguards in place, you can ground answers using offline retrieval that stays on premises


Offline RAG Flows

What you’ll learn:

  • How to choose and run a vector database locally
  • How to ingest and version documents safely
  • Prompt patterns that reduce hallucinations

Vector choices and ingestion

NeedQdrantWeaviate
Simple opsYesYes
CPU friendly HNSWYesYes
Hybrid searchNoYes
SnapshotsYesYes
  1. Split documents into 512 to 1,024 token chunks, embed them, and upsert with metadata
  2. Store de identified knowledge separately from patient data
  3. Version collections and never overwrite clinical policies in place
# Create embeddings with Ollama locally
curl http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"Adult asthma guideline v2024 section 3.1"}'
  • Tag embeddings with source, specialty, version, and jurisdiction
  • Keep a golden set for evaluation and reject drifty updates
  • Snapshot the index before large ingest jobs

Prompt patterns that behave

SYSTEM: You are a cautious clinical assistant. Cite internal docs.
RULES: If unsure, say so. Don’t create facts. Follow {policy}.
CONTEXT: {{top_k_passages}}
QUESTION: {{user_question}}
OUTPUT: bullet points; flag red flags; link to policy ids.
  • Always pass policy snippets next to guidelines
  • Ask for uncertainty to reduce hallucinations
  • Cap output length to protect latency budgets

Three example flows

1. Symptom triage

  • Trigger: inbound WhatsApp message
  • Steps:
    1. Redact identifiers
    2. Retrieve guideline passages
    3. Generate advice
    4. Route to RN review queue for sign off
  • Guardrails: block urgent keywords to 911 banners and require RN approval before send
flowchart TD
    I[Message In] --> R1[Redact]
    R1 --> Q1[Retrieve]
    Q1 --> G1[Generate]
    G1 --> H1[RN Review]
    H1 --> P1[Send]

    classDef process fill:#fff3e0,stroke:#ef6c00
    class I,R1,Q1,G1,H1,P1 process

2. Clinician note summarization

  • Trigger: encounter closed in EHR
  • Steps:
    1. MCP pulls meds and allergies
    2. Construct context with recent visits
    3. Produce SOAP style summary
    4. Draft to EHR inbox for approval
  • Guardrails: clinician approval required and log all tool calls with purpose

3. Lab explanation

  • Trigger: new lab result
  • Steps:
    1. Compare to baseline values
    2. Retrieve condition explainer
    3. Create plain language summary
    4. Post to patient portal
  • Guardrails: critical values held for physician review and throttle messages to avoid alert fatigue

Dry run each flow with synthetic PHI before touching production data

Up next, size models and hardware to hit latency and cost targets


Latency and Cost

What you’ll learn:

  • Practical latency budgets for common use cases
  • Model choices, quantization, and warm up
  • Hardware and cost patterns that scale

Practical budgets

Use caseP95 targetNotes
Triage reply1 to 3 sShort prompts, cached policy, 7B models
Doc summary5 to 12 sAsync is fine and batch overnight for long notes
Lab explainer2 to 5 sPre warm model and reuse embeddings
  • Measure end to end, not just LLM time
  • Cache top K passages and tokenizer outputs
  • Abort and fallback if time budget is exceeded

Models and quantization

  • Start with 7B class models such as Mistral or Llama using 4 bit quantization
  • Move to 13B if accuracy gains justify added latency
  • Keep a tiny policy only model for instant classification
# Pull and run a local chat model
ollama pull mistral:instruct
curl http://localhost:11434/api/chat -d '{
  "model":"mistral:instruct",
  "messages":[{"role":"user","content":"Summarize: patient has persistent cough, no fever."}]
}'
  • Pre warm models at shift start
  • Pin versions and update on a cadence with quick rollbacks
  • Profile tokens per second and optimize prompt size before buying hardware

Hardware sizing

  • Small clinic: one 24 to 32 core CPU, 128 to 256 GB RAM, one RTX 4090 or L40S
  • Mid org: two to three GPUs and a separate box for vector DB and logging
  • Storage: fast NVMe for indexes and model weights

Scale by parallelism and caching first and scale by hardware last

Cost modeling and rollout

ItemCloud LLM APIsLocal stack
Per 1M tokens (est)$5 to $30$0 after capital spend
Year 1 spendOperating spend heavyCapital plus light operating spend
Multi yearVariablePredictable
  1. Pick one workflow with visible value
  2. Ship a pilot to a small cohort and measure latency, safety, and satisfaction
  3. Add guardrails, alerts, and dashboards
  4. Expand to a second workflow and reuse blocks
  5. Formalize governance and treat the stack as a small platform

Keep exit ramps open with open formats, containerized services, and no proprietary lock ins

💡

Roadmap: start with triage or doc summaries, enforce de identification at the edge, pin a 7B model in Ollama, wire tools through Rails MCP with strict scopes, and orchestrate in n8n. Measure, tighten, then expand

📧