How Shopify Cut a Near $1M BigQuery Bill to ~$1.4K

Shopify BigQuery Savings

What you’ll learn: How bytes scanned drive cost, the before/after impact, and the core idea behind the fix

A single unoptimized query can burn cash. Shopify saw each run scan ~75 GB, which at scale projected to nearly $1M per month.

60 requests/min × 60 × 24 × 30 ≈ 2.59M queries/month
75 GB scanned/query on on‑demand pricing turned into massive spend
Clustering dropped each scan to ~508 MB and the bill to about $1.4K/month

In short, bytes scanned are the budget; prune the bytes, save the budget

You pay for the data queries touch, not for queries themselves

Before (approx)
- Bytes/query: 75 GB
- Monthly queries: 2,592,000
- On‑demand price: $/TB by data processed
- Projected monthly cost: ≈ just under $1,000,000

After (approx)
- Bytes/query: ~508 MB
- Same volume
- Projected monthly cost: ≈ $1,400

Callout: BigQuery rewards storage layout that helps it skip work

Mermaid: Cost model at a glance

flowchart TD
    A[Query] --> B[Columns read]
    B --> C[Bytes scanned]
    C --> D[Cost]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A trigger
    class B process
    class C process
    class D action

Next, the context that made this optimization matter

Background & Challenge

What you’ll learn: Why BigQuery fit, what changed at scale, and the core constraint

Shopify needed a warehouse that ingests at scale, answers fast, and fits GCP. BigQuery fit due to columnar storage, serverless design, and low ops

Columnar storage (data stored by column for fast analytics reads)
On‑demand pricing (pay by bytes processed, good for spiky loads)
GCP integration reduced friction

The challenge was not speed but cost predictability once traffic ramped

Callout: Columnar systems shine when queries read few columns and skip most data

Mermaid: Minimal events model

erDiagram
    Events {
        int id
        int user_id
        string campaign_id
        string country
        datetime event_time
        int revenue
    }

With the context set, let’s unpack how pricing and scans actually work

Pricing & Scans

What you’ll learn: How on‑demand pricing bills, what inflates scans, and quick guardrails

BigQuery on‑demand pricing bills by data processed, not CPU time. If a query touches columns that total 75 GB, you pay for 75 GB

Bytes billed follow referenced columns and filters that prune data
SELECT * reads all columns and explodes scans
Partitioning (split data into ranges) and clustering (sort by keys within parts) enable storage‑level skipping

-- Costly pattern
SELECT *
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
  AND country = 'US';

-- Better: project needed columns only
SELECT event_date, user_id, campaign_id, revenue
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
  AND country = 'US';

Mermaid: Query pruning flow

flowchart TD
    A[Query filters] --> B[Partition prune]
    B --> C[Cluster prune]
    C --> D[Fewer bytes]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A process
    class B process
    class C process
    class D action

Set guardrails while iterating, then raise them with confidence

-- Safety: cap how much a query can bill
-- bq --maximum_bytes_billed=1000000000  -- ~1 GB

Callout: Use INFORMATION_SCHEMA to spot heavy jobs and educate teams on SELECT discipline

With billing mechanics clear, here’s the change that moved the needle

Clustering Fix

What you’ll learn: How clustering works, how to pick keys, and how to roll out safely

Clustering sorts rows by high‑value filters present in WHERE clauses. It lets BigQuery skip most blocks without scanning them

Clustering sorts data by chosen keys to improve pruning
Partitioning splits data into ranges for coarse pruning
Selectivity means a filter keeps a small fraction of rows

Pick the right keys

Queries reveal how you search the haystack. Start there

Collect 30 days of queries and extract frequent filters
Rank columns by selectivity and frequency, prefer high‑cardinality fields
Keep the set small, often 1–2 columns

-- Example: cluster by country then campaign_id
CREATE TABLE analytics.events_clustered
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT * FROM analytics.events_raw;

Implement step by step

You can cluster existing tables without a full rewrite

Create a new table with PARTITION BY and CLUSTER BY
Backfill with INSERT…SELECT or CTAS
Swap consumers and monitor bytes billed

CREATE TABLE prod.events_v2
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT event_timestamp, country, campaign_id, user_id, revenue, attrs
FROM prod.events_v1;

Mermaid: Rollout steps

flowchart TD
    A[Create table] --> B[Backfill]
    B --> C[Swap reads]
    C --> D[Monitor cost]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A process
    class B process
    class C action
    class D action

Partition + cluster

Use both when filters combine date with attributes

Technique	Best when	Caveat
Partitioning	Most queries filter by date	Coarse pruning only
Clustering	Queries filter on attributes	Pre‑exec byte estimates can vary
Both	Filters include date + attributes	Slightly more modeling upfront

Callout: Order clustering keys by most selective first and let auto reclustering do the rest

Now let’s see what changed in dollars and seconds

Results & Lessons

What you’ll learn: The savings, the latency gains, and a reusable playbook

Clustering transformed costs without hurting speed. The same query scanned megabytes instead of gigabytes

Bytes/query: 75 GB - ~508 MB (≈150× reduction)
Monthly cost: ≈ just under $1,000,000 - ≈ $1,400
Latency: fewer bytes read, faster first row and completion

Cost math check

Monthly queries ≈ 2,592,000 (60 rpm)
Before ≈ 75 GB/query  - ≈ just under $1M/month
After  ≈ 0.508 GB/query - ≈ $1.4K/month
Notes: On‑demand bills by data processed; TiB vs TB rounding and free tiers apply

Performance gains

Faster dashboards and APIs due to block pruning
Lower tail latencies under load because scans stop earlier
More headroom for concurrent workloads

Tactics you can reuse

Write cheaper queries

Avoid SELECT * and project only needed columns
Push filters into WHERE to prune earlier
Use approximate or pre‑aggregations for repeated metrics

Add monitoring and guardrails

Cap spend per query with maximum bytes billed during development
Track heavy hitters via INFORMATION_SCHEMA.JOBS
Review partition and cluster recommendations monthly

-- Find the top 50 most expensive jobs last 7 days
SELECT user_email, total_bytes_processed, query, creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND statement_type = 'SELECT'
ORDER BY total_bytes_processed DESC
LIMIT 50;

Design for scale

Partition by date, then cluster by the most selective attributes
Materialize common aggregates and refresh incrementally
Expire staging tables by default to keep raw cold data cheap

Callout: Playbook model for pruning, cap bytes while iterating, monitor heavy hitters, and revisit layout as workloads evolve. The cheapest byte is the one you never scan