Private AI for Healthcare
What you’ll learn:
- How to ship private AI without sending PHI to public LLM APIs
- The local stack: n8n, Ollama, Rails MCP, and vector DB
- How to keep workflows auditable and HIPAA aware
Healthcare needs AI that respects PHI (protected health information), runs fast, and stays affordable. You can ship that today with local, auditable workflows
In 1854, Florence Nightingale turned wartime chaos into clarity using simple counts and clean process. When budgets are tight and stakes are high, rigor beats flash. Private AI in healthcare follows the same playbook: keep data close, flows simple, and controls explicit
- Goal: ship useful AI without sending PHI to public LLM APIs
- Stack: n8n for orchestration, Ollama for local models, Rails MCP for tools, vector DB for retrieval
- Outcome: auditable, HIPAA aware workflows that stay inside your network
Skip theory. Build small, safe loops that deliver value in days, not quarters
Next, see the stack and how data flows through clear trust boundaries
Stack and Architecture
What you’ll learn:
- Why this local stack protects PHI and reduces latency
- How the components connect in a simple flow
- A minimal docker compose for local development
Why this stack
- n8n: visual workflows, triggers, retries, secrets, role based access control (RBAC), and audit
- Ollama: local LLMs with quantization and a simple HTTP API
- Rails MCP: opinionated web stack plus least privilege tool access via model context protocol (MCP)
- Vector DB: semantic search for offline retrieval augmented generation (RAG) using Qdrant or Weaviate
Compare options
| Decision | Cloud LLM APIs | Local Stack |
|---|---|---|
| PHI handling | Data leaves VPC (virtual private cloud) | PHI stays on premises |
| Latency | Network round trip | LAN speed |
| Cost at scale | Per token fees | Fixed capital spend with low operating spend |
| Auditability | Vendor black box | Full local logs |
| Lock in risk | High | Low |
Takeaway: choose local when privacy, predictability, and control matter more than convenience
Architecture flow
flowchart TD
U1[Clinician] --> N[n8n]
U2[Patient] --> N
N --> V[Vector DB]
N --> R[Rails MCP]
R --> T1[EHR Read]
R --> T2[Policy Lookup]
R --> T3[Calc]
N --> O[Ollama]
O --> L[Audit Logs]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
classDef alert fill:#f3e5f5,stroke:#7b1fa2
class U1,U2 trigger
class N process
class V,R,O,L action
class T1,T2,T3 action
Minimal docker compose (dev)
version: "3.9"
services:
n8n:
image: n8nio/n8n:latest
environment:
- N8N_SECURE_COOKIE=true
- N8N_USER_MANAGEMENT_DISABLED=false
- N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}
ports: ["5678:5678"]
depends_on: [qdrant, ollama, rails]
ollama:
image: ollama/ollama:latest
volumes:
- ollama:/root/.ollama
ports: ["11434:11434"]
command: ["serve"]
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333"]
volumes:
- qdrant:/qdrant/storage
rails:
build: ./rails_mcp
environment:
- RAILS_ENV=production
- SECRET_KEY_BASE=${SECRET_KEY_BASE}
ports: ["3000:3000"]
volumes:
ollama:
qdrant:
Network and secrets
- Use private subnets and block egress on Ollama except updates
- Terminate TLS at an internal proxy and require mutual TLS (mTLS) between services
- Store secrets in a vault, not env files
Keep each service in its own security group to enforce least privilege network paths
Next, apply PHI safety and HIPAA controls to every flow
PHI Safety and HIPAA
What you’ll learn:
- What counts as PHI and where it is allowed
- How to enforce trust boundaries and redaction
- How to use encryption, RBAC, and audits
PHI scope and placement
- PHI: any data that links to an individual health status, care, or payment
- Allowed: EHR, secure queues, encrypted n8n execution context, protected Rails database
- Not allowed: vendor telemetry, error trackers without business associate agreements (BAAs), public LLM APIs
Default to deny. Then open only the narrow paths you can defend in an audit
Trust boundaries and redaction
- Ingest then redact then enrich then answer
- Strip identifiers early such as name, date of birth, medical record number, address, phone, exact dates
- Use scoped context windows and pass only the slices needed for the task
// n8n Code node (TypeScript)
const phi = $json;
const scrub = (s:string) => s
.replace(/[A-Z][a-z]+\s[A-Z][a-z]+/g, "[name]")
.replace(/\b\d{2}\/\d{2}\/\d{4}\b/g, "[date]")
.replace(/\b\d{10}\b/g, "[phone]");
return { data: scrub(JSON.stringify(phi)) };
- Validate de identification with unit tests and spot checks
- Maintain a PHI dictionary per locale to catch edge cases
- Keep the raw payload in a sealed store and run workflows on redacted copies
flowchart TD
A[Ingest] --> B[Redact]
B --> C[Enrich]
C --> D[Answer]
D --> E[Log]
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A,B,C,D,E process
Encryption, RBAC, and audits
- Encryption: TLS 1.2 plus in transit and AES 256 at rest, rotate keys on a schedule
- RBAC: roles in n8n, scopes per MCP tool, database row level security in Rails
- Audit: immutable centralized logs with minimal PHI including actor, action, and reason
{
"ts": "2025-12-22T14:10:05Z",
"actor": "svc:mcp-doc-summary",
"patient_ref": "hash:5f2c…",
"tool": "ehr.read_encounter",
"purpose": "clinical_note_summarization",
"workflow": "docs_summarize_v3",
"result": "success"
}
Add policy checks in flow. If purpose is not a permitted use, stop the run and alert
MCP least privilege example
# rails_mcp/config/tools.yml
- name: ehr.read_encounter
method: GET
path: /fhir/Encounter/{id}
scopes: ["read:encounter"]
pii_return: minimal
- name: policy.lookup
method: GET
path: /policies/{section}
scopes: ["read:policy"]
pii_return: none
- Each tool declares scopes and PHI exposure class
- The MCP server enforces purpose binding and scope checks
- Tools with write access require human approval gates in n8n
Practice least privilege end to end: scope tools, redact inputs, cap outputs, and log the reason for every access
With safeguards in place, you can ground answers using offline retrieval that stays on premises
Offline RAG Flows
What you’ll learn:
- How to choose and run a vector database locally
- How to ingest and version documents safely
- Prompt patterns that reduce hallucinations
Vector choices and ingestion
| Need | Qdrant | Weaviate |
|---|---|---|
| Simple ops | Yes | Yes |
| CPU friendly HNSW | Yes | Yes |
| Hybrid search | No | Yes |
| Snapshots | Yes | Yes |
- Split documents into 512 to 1,024 token chunks, embed them, and upsert with metadata
- Store de identified knowledge separately from patient data
- Version collections and never overwrite clinical policies in place
# Create embeddings with Ollama locally
curl http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"Adult asthma guideline v2024 section 3.1"}'
- Tag embeddings with source, specialty, version, and jurisdiction
- Keep a golden set for evaluation and reject drifty updates
- Snapshot the index before large ingest jobs
Prompt patterns that behave
SYSTEM: You are a cautious clinical assistant. Cite internal docs.
RULES: If unsure, say so. Don’t create facts. Follow {policy}.
CONTEXT: {{top_k_passages}}
QUESTION: {{user_question}}
OUTPUT: bullet points; flag red flags; link to policy ids.
- Always pass policy snippets next to guidelines
- Ask for uncertainty to reduce hallucinations
- Cap output length to protect latency budgets
Three example flows
1. Symptom triage
- Trigger: inbound WhatsApp message
- Steps:
- Redact identifiers
- Retrieve guideline passages
- Generate advice
- Route to RN review queue for sign off
- Guardrails: block urgent keywords to 911 banners and require RN approval before send
flowchart TD
I[Message In] --> R1[Redact]
R1 --> Q1[Retrieve]
Q1 --> G1[Generate]
G1 --> H1[RN Review]
H1 --> P1[Send]
classDef process fill:#fff3e0,stroke:#ef6c00
class I,R1,Q1,G1,H1,P1 process
2. Clinician note summarization
- Trigger: encounter closed in EHR
- Steps:
- MCP pulls meds and allergies
- Construct context with recent visits
- Produce SOAP style summary
- Draft to EHR inbox for approval
- Guardrails: clinician approval required and log all tool calls with purpose
3. Lab explanation
- Trigger: new lab result
- Steps:
- Compare to baseline values
- Retrieve condition explainer
- Create plain language summary
- Post to patient portal
- Guardrails: critical values held for physician review and throttle messages to avoid alert fatigue
Dry run each flow with synthetic PHI before touching production data
Up next, size models and hardware to hit latency and cost targets
Latency and Cost
What you’ll learn:
- Practical latency budgets for common use cases
- Model choices, quantization, and warm up
- Hardware and cost patterns that scale
Practical budgets
| Use case | P95 target | Notes |
|---|---|---|
| Triage reply | 1 to 3 s | Short prompts, cached policy, 7B models |
| Doc summary | 5 to 12 s | Async is fine and batch overnight for long notes |
| Lab explainer | 2 to 5 s | Pre warm model and reuse embeddings |
- Measure end to end, not just LLM time
- Cache top K passages and tokenizer outputs
- Abort and fallback if time budget is exceeded
Models and quantization
- Start with 7B class models such as Mistral or Llama using 4 bit quantization
- Move to 13B if accuracy gains justify added latency
- Keep a tiny policy only model for instant classification
# Pull and run a local chat model
ollama pull mistral:instruct
curl http://localhost:11434/api/chat -d '{
"model":"mistral:instruct",
"messages":[{"role":"user","content":"Summarize: patient has persistent cough, no fever."}]
}'
- Pre warm models at shift start
- Pin versions and update on a cadence with quick rollbacks
- Profile tokens per second and optimize prompt size before buying hardware
Hardware sizing
- Small clinic: one 24 to 32 core CPU, 128 to 256 GB RAM, one RTX 4090 or L40S
- Mid org: two to three GPUs and a separate box for vector DB and logging
- Storage: fast NVMe for indexes and model weights
Scale by parallelism and caching first and scale by hardware last
Cost modeling and rollout
| Item | Cloud LLM APIs | Local stack |
|---|---|---|
| Per 1M tokens (est) | $5 to $30 | $0 after capital spend |
| Year 1 spend | Operating spend heavy | Capital plus light operating spend |
| Multi year | Variable | Predictable |
- Pick one workflow with visible value
- Ship a pilot to a small cohort and measure latency, safety, and satisfaction
- Add guardrails, alerts, and dashboards
- Expand to a second workflow and reuse blocks
- Formalize governance and treat the stack as a small platform
Keep exit ramps open with open formats, containerized services, and no proprietary lock ins
Roadmap: start with triage or doc summaries, enforce de identification at the edge, pin a 7B model in Ollama, wire tools through Rails MCP with strict scopes, and orchestrate in n8n. Measure, tighten, then expand