Shopify BigQuery Savings
What you’ll learn: How bytes scanned drive cost, the before/after impact, and the core idea behind the fix
A single unoptimized query can burn cash. Shopify saw each run scan ~75 GB, which at scale projected to nearly $1M per month.
- 60 requests/min × 60 × 24 × 30 ≈ 2.59M queries/month
- 75 GB scanned/query on on‑demand pricing turned into massive spend
- Clustering dropped each scan to ~508 MB and the bill to about $1.4K/month
In short, bytes scanned are the budget; prune the bytes, save the budget
You pay for the data queries touch, not for queries themselves
Before (approx)
- Bytes/query: 75 GB
- Monthly queries: 2,592,000
- On‑demand price: $/TB by data processed
- Projected monthly cost: ≈ just under $1,000,000
After (approx)
- Bytes/query: ~508 MB
- Same volume
- Projected monthly cost: ≈ $1,400
Callout: BigQuery rewards storage layout that helps it skip work
Mermaid: Cost model at a glance
flowchart TD
A[Query] --> B[Columns read]
B --> C[Bytes scanned]
C --> D[Cost]
classDef trigger fill:#e1f5fe,stroke:#01579b
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A trigger
class B process
class C process
class D action
Next, the context that made this optimization matter
Background & Challenge
What you’ll learn: Why BigQuery fit, what changed at scale, and the core constraint
Shopify needed a warehouse that ingests at scale, answers fast, and fits GCP. BigQuery fit due to columnar storage, serverless design, and low ops
- Columnar storage (data stored by column for fast analytics reads)
- On‑demand pricing (pay by bytes processed, good for spiky loads)
- GCP integration reduced friction
The challenge was not speed but cost predictability once traffic ramped
Callout: Columnar systems shine when queries read few columns and skip most data
Mermaid: Minimal events model
erDiagram
Events {
int id
int user_id
string campaign_id
string country
datetime event_time
int revenue
}
With the context set, let’s unpack how pricing and scans actually work
Pricing & Scans
What you’ll learn: How on‑demand pricing bills, what inflates scans, and quick guardrails
BigQuery on‑demand pricing bills by data processed, not CPU time. If a query touches columns that total 75 GB, you pay for 75 GB
- Bytes billed follow referenced columns and filters that prune data
- SELECT * reads all columns and explodes scans
- Partitioning (split data into ranges) and clustering (sort by keys within parts) enable storage‑level skipping
-- Costly pattern
SELECT *
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
AND country = 'US';
-- Better: project needed columns only
SELECT event_date, user_id, campaign_id, revenue
FROM analytics.events
WHERE event_date BETWEEN '2025-11-01' AND '2025-11-07'
AND country = 'US';
Mermaid: Query pruning flow
flowchart TD
A[Query filters] --> B[Partition prune]
B --> C[Cluster prune]
C --> D[Fewer bytes]
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A process
class B process
class C process
class D action
Set guardrails while iterating, then raise them with confidence
-- Safety: cap how much a query can bill
-- bq --maximum_bytes_billed=1000000000 -- ~1 GB
Callout: Use INFORMATION_SCHEMA to spot heavy jobs and educate teams on SELECT discipline
With billing mechanics clear, here’s the change that moved the needle
Clustering Fix
What you’ll learn: How clustering works, how to pick keys, and how to roll out safely
Clustering sorts rows by high‑value filters present in WHERE clauses. It lets BigQuery skip most blocks without scanning them
- Clustering sorts data by chosen keys to improve pruning
- Partitioning splits data into ranges for coarse pruning
- Selectivity means a filter keeps a small fraction of rows
Pick the right keys
Queries reveal how you search the haystack. Start there
- Collect 30 days of queries and extract frequent filters
- Rank columns by selectivity and frequency, prefer high‑cardinality fields
- Keep the set small, often 1–2 columns
-- Example: cluster by country then campaign_id
CREATE TABLE analytics.events_clustered
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT * FROM analytics.events_raw;
Implement step by step
You can cluster existing tables without a full rewrite
- Create a new table with PARTITION BY and CLUSTER BY
- Backfill with INSERT…SELECT or CTAS
- Swap consumers and monitor bytes billed
CREATE TABLE prod.events_v2
PARTITION BY DATE(event_timestamp)
CLUSTER BY country, campaign_id AS
SELECT event_timestamp, country, campaign_id, user_id, revenue, attrs
FROM prod.events_v1;
Mermaid: Rollout steps
flowchart TD
A[Create table] --> B[Backfill]
B --> C[Swap reads]
C --> D[Monitor cost]
classDef process fill:#fff3e0,stroke:#ef6c00
classDef action fill:#e8f5e8,stroke:#2e7d32
class A process
class B process
class C action
class D action
Partition + cluster
Use both when filters combine date with attributes
| Technique | Best when | Caveat |
|---|---|---|
| Partitioning | Most queries filter by date | Coarse pruning only |
| Clustering | Queries filter on attributes | Pre‑exec byte estimates can vary |
| Both | Filters include date + attributes | Slightly more modeling upfront |
Callout: Order clustering keys by most selective first and let auto reclustering do the rest
Now let’s see what changed in dollars and seconds
Results & Lessons
What you’ll learn: The savings, the latency gains, and a reusable playbook
Clustering transformed costs without hurting speed. The same query scanned megabytes instead of gigabytes
- Bytes/query: 75 GB - ~508 MB (≈150× reduction)
- Monthly cost: ≈ just under $1,000,000 - ≈ $1,400
- Latency: fewer bytes read, faster first row and completion
Cost math check
Monthly queries ≈ 2,592,000 (60 rpm)
Before ≈ 75 GB/query - ≈ just under $1M/month
After ≈ 0.508 GB/query - ≈ $1.4K/month
Notes: On‑demand bills by data processed; TiB vs TB rounding and free tiers apply
Performance gains
- Faster dashboards and APIs due to block pruning
- Lower tail latencies under load because scans stop earlier
- More headroom for concurrent workloads
Tactics you can reuse
Write cheaper queries
- Avoid SELECT * and project only needed columns
- Push filters into WHERE to prune earlier
- Use approximate or pre‑aggregations for repeated metrics
Add monitoring and guardrails
- Cap spend per query with maximum bytes billed during development
- Track heavy hitters via INFORMATION_SCHEMA.JOBS
- Review partition and cluster recommendations monthly
-- Find the top 50 most expensive jobs last 7 days
SELECT user_email, total_bytes_processed, query, creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND statement_type = 'SELECT'
ORDER BY total_bytes_processed DESC
LIMIT 50;
Design for scale
- Partition by date, then cluster by the most selective attributes
- Materialize common aggregates and refresh incrementally
- Expire staging tables by default to keep raw cold data cheap
Callout: Playbook model for pruning, cap bytes while iterating, monitor heavy hitters, and revisit layout as workloads evolve. The cheapest byte is the one you never scan