7 min read

How Shopify Cut a Near $1M BigQuery Bill to ~$1.4K

Hero image for How Shopify Cut a Near $1M BigQuery Bill to ~$1.4K
Table of Contents

Shopify BigQuery Savings

What you’ll learn: How bytes scanned drive cost, the before/after impact, and the core idea behind the fix

A single unoptimized query can burn cash. Shopify saw each run scan ~75 GB, which at scale projected to nearly $1M per month.

  • 60 requests/min × 60 × 24 × 30 ≈ 2.59M queries/month
  • 75 GB scanned/query on on‑demand pricing turned into massive spend
  • Clustering dropped each scan to ~508 MB and the bill to about $1.4K/month

In short, bytes scanned are the budget; prune the bytes, save the budget

You pay for the data queries touch, not for queries themselves

Before (approx)
- Bytes/query: 75 GB
- Monthly queries: 2,592,000
- On‑demand price: $/TB by data processed
- Projected monthly cost: ≈ just under $1,000,000

After (approx)
- Bytes/query: ~508 MB
- Same volume
- Projected monthly cost: ≈ $1,400

Callout: BigQuery rewards storage layout that helps it skip work

Mermaid: Cost model at a glance

flowchart TD
    A[Query] --> B[Columns read]
    B --> C[Bytes scanned]
    C --> D[Cost]

    classDef trigger fill:#e1f5fe,stroke:#01579b
    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A trigger
    class B process
    class C process
    class D action

Next, the context that made this optimization matter


Background & Challenge

What you’ll learn: Why BigQuery fit, what changed at scale, and the core constraint

Shopify needed a warehouse that ingests at scale, answers fast, and fits GCP. BigQuery fit due to columnar storage, serverless design, and low ops

  • Columnar storage (data stored by column for fast analytics reads)
  • On‑demand pricing (pay by bytes processed, good for spiky loads)
  • GCP integration reduced friction

The challenge was not speed but cost predictability once traffic ramped

Callout: Columnar systems shine when queries read few columns and skip most data

Mermaid: Minimal events model

erDiagram
    Events {
        int id
        int user_id
        string campaign_id
        string country
        datetime event_time
        int revenue
    }

With the context set, let’s unpack how pricing and scans actually work


Pricing & Scans

What you’ll learn: How on‑demand pricing bills, what inflates scans, and quick guardrails

BigQuery on‑demand pricing bills by data processed, not CPU time. If a query touches columns that total 75 GB, you pay for 75 GB

  • Bytes billed follow referenced columns and filters that prune data
  • SELECT * reads all columns and explodes scans
  • Partitioning (split data into ranges) and clustering (sort by keys within parts) enable storage‑level skipping
-- Costly pattern
SELECT *
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
  AND country = 'US';

-- Better: project needed columns only
SELECT event_date, user_id, campaign_id, revenue
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
  AND country = 'US';

Mermaid: Query pruning flow

flowchart TD
    A[Query filters] --> B[Partition prune]
    B --> C[Cluster prune]
    C --> D[Fewer bytes]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A process
    class B process
    class C process
    class D action

Set guardrails while iterating, then raise them with confidence

-- Safety: cap how much a query can bill
-- bq --maximum_bytes_billed=1000000000  -- ~1 GB

Callout: Use INFORMATION_SCHEMA to spot heavy jobs and educate teams on SELECT discipline

With billing mechanics clear, here’s the change that moved the needle


Clustering Fix

What you’ll learn: How clustering works, how to pick keys, and how to roll out safely

Clustering sorts rows by high‑value filters present in WHERE clauses. It lets BigQuery skip most blocks without scanning them

  • Clustering sorts data by chosen keys to improve pruning
  • Partitioning splits data into ranges for coarse pruning
  • Selectivity means a filter keeps a small fraction of rows

Pick the right keys

Queries reveal how you search the haystack. Start there

  1. Collect 30 days of queries and extract frequent filters
  2. Rank columns by selectivity and frequency, prefer high‑cardinality fields
  3. Keep the set small, often 1–2 columns
-- Example: cluster by country then campaign_id
CREATE TABLE analytics.events_clustered
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT * FROM analytics.events_raw;

Implement step by step

You can cluster existing tables without a full rewrite

  1. Create a new table with PARTITION BY and CLUSTER BY
  2. Backfill with INSERT…SELECT or CTAS
  3. Swap consumers and monitor bytes billed
CREATE TABLE prod.events_v2
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT event_timestamp, country, campaign_id, user_id, revenue, attrs
FROM prod.events_v1;

Mermaid: Rollout steps

flowchart TD
    A[Create table] --> B[Backfill]
    B --> C[Swap reads]
    C --> D[Monitor cost]

    classDef process fill:#fff3e0,stroke:#ef6c00
    classDef action fill:#e8f5e8,stroke:#2e7d32
    class A process
    class B process
    class C action
    class D action

Partition + cluster

Use both when filters combine date with attributes

TechniqueBest whenCaveat
PartitioningMost queries filter by dateCoarse pruning only
ClusteringQueries filter on attributesPre‑exec byte estimates can vary
BothFilters include date + attributesSlightly more modeling upfront

Callout: Order clustering keys by most selective first and let auto reclustering do the rest

Now let’s see what changed in dollars and seconds


Results & Lessons

What you’ll learn: The savings, the latency gains, and a reusable playbook

Clustering transformed costs without hurting speed. The same query scanned megabytes instead of gigabytes

  • Bytes/query: 75 GB - ~508 MB (≈150× reduction)
  • Monthly cost: ≈ just under $1,000,000 - ≈ $1,400
  • Latency: fewer bytes read, faster first row and completion

Cost math check

Monthly queries ≈ 2,592,000 (60 rpm)
Before ≈ 75 GB/query  - ≈ just under $1M/month
After  ≈ 0.508 GB/query - ≈ $1.4K/month
Notes: On‑demand bills by data processed; TiB vs TB rounding and free tiers apply

Performance gains

  • Faster dashboards and APIs due to block pruning
  • Lower tail latencies under load because scans stop earlier
  • More headroom for concurrent workloads

Tactics you can reuse

Write cheaper queries

  1. Avoid SELECT * and project only needed columns
  2. Push filters into WHERE to prune earlier
  3. Use approximate or pre‑aggregations for repeated metrics

Add monitoring and guardrails

  1. Cap spend per query with maximum bytes billed during development
  2. Track heavy hitters via INFORMATION_SCHEMA.JOBS
  3. Review partition and cluster recommendations monthly
-- Find the top 50 most expensive jobs last 7 days
SELECT user_email, total_bytes_processed, query, creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND statement_type = 'SELECT'
ORDER BY total_bytes_processed DESC
LIMIT 50;

Design for scale

  1. Partition by date, then cluster by the most selective attributes
  2. Materialize common aggregates and refresh incrementally
  3. Expire staging tables by default to keep raw cold data cheap

Callout: Playbook model for pruning, cap bytes while iterating, monitor heavy hitters, and revisit layout as workloads evolve. The cheapest byte is the one you never scan

📧