On a quiet assumption that would have cost us more than money.

When you've worked with cloud storage for a while, list_blobs() feels instinctive. You want to know what's in a bucket, you list it. The API is right there. The SDK call is one line.

We nearly did exactly that for a migration involving 30+ million objects. This post is about why we didn't, and what we used instead.

The Problem with Listing at Scale

Object storage list APIs are not designed for iterating over tens of millions of objects in a tight loop. They page - typically 1,000 results per request - and they charge you for each page. On Google Cloud Storage, list operations are Class A: the most expensive tier.

The maths for 30 million objects looks like this:

30,000,000 objects / 1,000 per page = 30,000 list requests.
30,000 requests × $0.005 / 1,000 = $0.15 per listing.

The maths for 30 million objects

$0.15 sounds trivial. But migrations don't run once. They crash, restart, get paused, resumed. Add metadata reads to determine which objects to process. Add the fact that the bucket is live and objects are added while you're listing, so you might list multiple times for consistency.

The deeper problem: a live list is a moving target. If you list a 30-million-object bucket over the course of an hour, you're not guaranteed a stable snapshot. Objects written during your listing may appear. Objects deleted may or may not appear. For a migration where correctness matters, building logic on a mutable foundation is risky.

We needed a stable, complete list of objects. We needed it to be cheap to obtain. And we needed to be able to re-read it without hitting the API again.

The Discovery Problem: Finding a Stable Foundation

Stepping back from the implementation, what we actually needed was:
1. A complete, point-in-time list of all object names in the bucket

2. Sortable (we needed newest-first processing order)

3. Deduplicated (some objects appeared in multiple report windows)

4. Cheap to obtain repeatedly - if the migration restarted, we shouldn't re-pay for discovery

When you look at those four requirements, you realise they aren't describing an API response; they are describing a file. Specifically, a sorted CSV file containing one object name per line.

The question was: who generates it?

The Solution: Leveraging Inventory Reports

GCS (and equivalents on other cloud providers) offers bucket inventory reports - a scheduled export of all object metadata to CSV or parquet files, deposited into a prefix in your bucket or a separate bucket entirely, your choice. The export is handled by the storage provider's infrastructure. You pay for the CSV storage, which is cheap, but you don't pay for the listing.

We configured daily reports. The schema included the object name, and the reports were date-partitioned:

inventory-reports/
└── date=2026-01-14/
    └── objects_000.csv
    └── objects_001.csv

Each CSV row provided the bucket name, storage class, object name, size, and last-modified date. It was everything we needed, waiting for us in a flat file.

These inventory report files became the canonical input for the migration.

Why This Composability Matters


The pre-processing step and the migration step are now entirely decoupled. They have different failure modes, different runtime characteristics, and different restart behaviours.
Pre-processing is a one-time offline job. It runs to completion, writes its output, and is done. If it fails halfway, a checkpoint tracks which batch files were written.Restart picks up from there.

Migration is a long-running, stateful process. It has its own checkpointing, its own idempotency checks, its own dead-letter handling. It assumes its input is stable - and it is, because it's reading from files, not from a live API.
Neither step needs to know about the other's internal state. The batch files are the contract between them.

This is a pattern worth generalising: for any large-scale batch operation on object storage, consider separating discovery (what exists?) from processing (what do we do with it?). Inventory reports make discovery cheap, stable, and repeatable. That frees processing to focus on correctness.

What We Got

  • No live list API calls during migration: The bucket could be actively receiving writes. The migration didn't care - it was reading from a snapshot.
  • Consistent input: Every restart of the migration processed the same set of IDs from the same sorted batch files.
  • Built-in sort order: Descending time order came for free from the sort step. This encoded a useful processing invariant (newest version first) without any additional infrastructure.
  • Idempotency: Re-running the pre-processing step produced the same output without extra costs.

The Setup Cost


Configuring inventory reports on GCS takes about five minutes. Select your bucket in the console, go to the inventory reports tab, create a report, choose CSV format, set the destination prefix, set the frequency to daily. The first report arrives within 24 hours.
After that: it runs automatically. The CSVs are there whenever you need them — for migrations, for audits, for capacity planning, for anything that needs a complete view of your bucket without paying to generate one.

If you're running at any scale where a list operation costs meaningful time or money, inventory reports should be your default starting point. The data is already being generated. You just need to read it.