Skip to main content

Remote Cache CDC: Reusing Bytes

[object Object]
Tyler French, Engineer @ BuildBuddy

New content being added to reused content

The goal: move the changed bytes, not the whole output.

BuildBuddy's Remote Cache uses Content-Defined Chunking (CDC) to make large build outputs behave more incrementally. When a binary, bundle, package, or archive is mostly unchanged, BuildBuddy can reuse chunks it has already seen instead of re-uploading or re-downloading the entire file.

In our Bazel chunking implementation PR, we observed 40% less data uploaded and a 40% smaller disk cache when benchmarked on BuildBuddy's own repo. To enable client-side CDC with BuildBuddy, use Bazel 8.7 or 9.1+ and pass --experimental_remote_cache_chunking.

Setting the Scene

The next frontier for build caching is not just skipping actions. It is skipping bytes.

Build caching has come a long way. Instead of rebuilding the world after every edit, Bazel and remote caching let teams reuse action outputs across machines and CI jobs. In practice, builds have moved from something closer to O(size of repo) toward O(size of change).

But "size of change" can be misleading. What really matters is the size of the transitive actions affected by the edit. A small source change can still ripple into many binaries, packages, bundles, and other large outputs, even when only a small part of each output actually changes.

That invalidation is expected. Build systems should rerun an action when its inputs change. The remote-cache problem is what happens next: the cache sees a new digest and moves the whole blob, even if that blob is mostly the same bytes as the previous version.

Transitive Actions

Linking, bundling, packaging, and archiving are where this shows up most often. They combine many transitive inputs into one output.

That makes them different from actions that operate on a small, direct set of files. A typical compile action might compile one source file using a smaller set of direct inputs. A transitive action, on the other hand, often consumes the accumulated outputs of many dependencies and produces one final binary, bundle, package, or archive.

In Bazel rules, this often shows up as a rule collecting files through a transitive depset and passing that accumulated set into a single action. For example, a simplified compile action might look like this:

ctx.actions.run(
inputs = [src] + direct_headers,
outputs = [obj],
executable = compiler,
arguments = ["-c", src.path, "-o", obj.path],
)

A bundling or packaging action often looks more like this:

transitive_inputs = depset(
direct = direct_files,
transitive = [dep[MyInfo].files for dep in ctx.attr.deps],
)

ctx.actions.run(
inputs = transitive_inputs,
outputs = [bundle],
executable = bundler,
arguments = ["--output", bundle.path],
)

That second shape is where small source changes can fan out into large output changes. The source edit might only change a small sequence of bytes in the final output, but the output digest is still new.

Without CDC, the cache treats that as a completely new blob, even when most of the binary, bundle, package, or archive is byte-for-byte identical to the previous version. If many final outputs depend on that changed input, they can all get new digests.

For remote caching, the expensive part is not just that the output is large. It is that the output is large and mostly similar to something the cache already has, but the whole-blob digest is new.

That creates two problems:

  • Uploads and downloads move the whole blob, even when only a small part changed.
  • Storage keeps another whole blob, even when most bytes are duplicates.

One workaround is to disable remote caching for these actions. That avoids uploading huge outputs when the expected cache hit is not worth the write cost, but it creates a different problem: the action now has to run every time. It can also make the action harder to move to remote execution, because RBE depends on moving action inputs and outputs efficiently.

So the build avoids one expensive cache write, but gives up reuse entirely.

Transitive action collapse

A small source change can invalidate the final transitive action.

Case study: Go tests

A common example is a shared go_library, say foo, that is imported by many other libraries: bar1, bar2, through barN. Each bar library may also have its own go_test.

An implementation-only change in foo might only rebuild foo's own GoCompilePkg action. The downstream compile actions can often still hit cache because Go compilation depends on direct dependency export data, like foo.x, not the full transitive archive graph.

Linking is different. Each go_test needs a test binary, produced by a GoLink action, and that link action consumes the transitive set of Go archives, like foo.a. If foo.a changes, many downstream test binaries can get new digests even when their source and compile actions did not change. Finally, the TestRunner action needs that test binary as an input in order to run it.

That means one small source edit can create many new test binary digests. Those test binaries are often large, and many of them are mostly the same bytes as before. Without CDC, each one is still transferred and stored as a new whole blob.

Treating This as an Output Problem

One option would be to make the actions themselves incremental: incremental linking, runtime linking, smarter bundling, smarter packaging, and so on. But this is usually very difficult, and requires extensive changes to the linkers and tools themselves.

And even if we solved that for one tool, we would still need separate solutions for GoLink, C++ linkers, JavaScript bundlers, app packagers, generated archives, and every other action that can produce a large output. That does not scale.

Instead, we can treat this as a generic output problem: these actions create large files, where only a small amount of content is changing. With Content-Defined Chunking (CDC), we can leave the actions themselves untouched, while still getting many of the wins of making those actions incremental.

Content-Defined Chunking

CDC is a repeatable process for splitting a file into chunks based on its contents rather than fixed byte offsets.

The TL;DR is: run a rolling hash over a small window of bytes, and split when the hash matches a rare pattern. The hash behaves randomly enough that this happens only occasionally, but the process is still deterministic: the same content produces the same chunk boundaries.

If you want chunks around 512 KiB on average, choose a pattern that has about a 1 in 512 KiB chance of matching at each byte. If the pattern does not match, shift the window and try again. Over time, this gives you the average chunk size you wanted while keeping the boundaries content-defined.

Smaller chunks improve deduplication but increase metadata overhead and RPC cost, so CDC implementations balance chunk size against efficiency.

For a toy example, imagine the rolling window is 4 bytes wide and we split whenever the hash of that 4-byte window ends in 00. Suppose the windows bbbb and cccc both happen to match that pattern (the exact hash values do not matter):

original: aaaabbbbccccdddd
windows: bbbb
cccc
cuts: aaaa|bbbb|cccc|dddd

If we insert a few bytes inside bbbb, the nearby windows change, so that chunk changes:

updated: aaaabbXXbbccccdddd

But once the rolling window moves past the inserted bytes and reaches cccc again, it sees the same 4-byte sequence as before. That sequence produces the same hash, so the algorithm finds the same cut point again. The later chunks can keep the same boundaries and hashes.

Real CDC uses a larger rolling window and a much rarer split pattern, but the idea is the same.

This means that a large file with a few bytes added or removed somewhere in the file usually only changes the nearby chunk(s). Once the rolling window moves past the changed bytes and reaches unchanged content again, it starts seeing the same byte sequences as before, so it finds the same future cut points.

One common CDC algorithm is FastCDC. The FastCDC presentation slides are also a helpful visual overview.

CDC chunk stability

Only the changed chunk needs to be uploaded again.

How does this benefit remote caching?

If an action creates a large output, like GoLink or CppLink, a small input change may still produce a new output that is mostly identical to the previous one.

With CDC, the cache can split that output into chunks and discover that many of them already exist. Instead of uploading the whole output again, it uploads only the missing chunks.

This works especially well for CI and developer builds, where nearby commits often produce outputs that are mostly similar. Once a chunk has been uploaded, future builds can reuse it across related outputs.

GoLink output stays mostly stable after an insertion

Most of the output can still map to chunks that already exist in the cache.

Results

Write deduplication ratio

In this recent window, CDC deduplicated about 85% of written bytes across eligible BuildBuddy cache writes. In other words, most large-output writes were already present as reusable chunks, so only the remaining changed chunks needed to be uploaded.

Write bytes saved per hour

Over this two-week window, CDC skipped uploading ~300 TiB of duplicate chunk data on the write path, with peaks over 4 TiB per hour. This comes from write-side chunk deduplication across BuildBuddy-managed cache writes and executor output uploads. Total network savings should be higher, since this does not include read-side savings when chunks are served from disk caches, regional caches, or executor file caches.

In production, CDC has already skipped hundreds of TiB of duplicate chunk uploads. Because BuildBuddy stores less duplicate data, effective cache retention has also improved.

The Bazel implementation PR benchmarked 50 commits of the BuildBuddy repo and saw about 40% less data uploaded, about 40% smaller disk cache, and faster builds in that benchmark.

BuildBuddy currently applies chunking to blobs larger than 2 MiB. In one test, only about 4.2% of objects were above that threshold, so most blobs are not chunked.

Within that eligible subset, CDC deduplicated about 85% of written bytes. Across all cache traffic, overall savings are typically in the 20 to 40% range.

As a rule of thumb, CDC works best for outputs that are large and byte-stable across revisions. Linking and packaging tend to be good fits, and most large outputs we see reuse most of their bytes. Bundling is also a good fit when the output is not compressed, obfuscated, or randomized.

Compression is not terrible, but it usually causes more churn. Compressed formats like tar.gz archives and Docker image layers are often less chunkable because a small input change can rewrite more of the compressed byte stream. The key property is byte-level similarity, not the file extension.

Implementation

To make this work end to end, the change lands in three places:

  • Remote APIs define the shared SplitBlob / SpliceBlob protocol so clients and caches can talk about chunks.
  • BuildBuddy implements the server-side cache behavior and executor-side chunked uploads and downloads.
  • Bazel implements the client-side combined cache path so the local disk cache and remote cache can share chunks.

Remote APIs: Split and Splice

To make CDC useful for remote caching, clients and servers need a way to talk about chunks instead of only whole blobs. This is especially useful when the network is the bottleneck: users on slow networks, VPNs, or with high latency to the cache should not need to upload or download a whole large output when most of its chunks already exist somewhere.

Instead, the client can discover how a blob maps to chunks, check which chunks are already available locally, and transfer only the missing pieces.

This is where SplitBlob and SpliceBlob come in.

SplitBlob is the read-side API. Given the digest of a large blob, the client asks the cache if it already knows the chunk layout for that blob. If it does, the client can download only the chunks it does not already have.

SpliceBlob is the write-side API. After an action creates a large output, Bazel or the executor uploads any missing chunks and tells the cache how to reconstruct the full blob from those chunks. The cache stores that reconstruction metadata so future SplitBlob calls for the same blob digest can return the chunk layout.

The read path becomes:

  1. Call SplitBlob to get the chunk layout for a large blob.
  2. Check which chunks are already present in the local cache.
  3. Download the missing chunks with Read or BatchReadBlobs.

The write path is the reverse:

  1. After producing a large output, the client or executor runs it through the CDC algorithm to compute chunk boundaries and chunk digests.
  2. It calls FindMissingBlobs to check which chunks the cache is missing.
  3. It uploads only the missing chunks with Write or BatchUpdateBlobs.
  4. It calls SpliceBlob to store the reconstruction metadata.

With this model, chunks are stored as normal CAS blobs under their own digests. The reconstruction metadata is keyed by the original large blob digest, so future SplitBlob calls can start from the digest they already know and discover the chunk layout.

This also helps distribute storage more evenly. Instead of treating one very large object as an indivisible cache entry, the cache can store and serve smaller chunks across the CAS like any other blob.

SplitBlob and SpliceBlob flows

SplitBlob is the read-side API; SpliceBlob is the write-side API.

Bazel Combined Cache

Bazel implements CDC in the combined cache, which coordinates remote cache and disk cache reads and writes.

When the remote cache advertises chunking support, Bazel creates chunked upload and download paths. Large blobs above the server-provided threshold use the chunked path; smaller blobs keep using the normal cache path.

One important implementation detail is that Bazel does not need to keep a second copy of every chunk in memory. The output already exists on disk, so the uploader can use the original file as the source for chunk data and stream the needed byte ranges during upload.

Chunk byte ranges instead of chunk copies

The client can keep byte ranges in the original file instead of a second copy of every chunk.

BuildBuddy Implementation

BuildBuddy implements CDC on the server side and in executors.

Server Side

The server side implements SplitBlob and SpliceBlob. Chunks are stored as normal CAS entries keyed by their chunk digest, while the reconstruction metadata is stored separately under a key derived from the original blob digest. When SpliceBlob is called, BuildBuddy verifies that the chunks exist and that concatenating them produces the original blob digest.

Because this happens behind the cache APIs, BuildBuddy can reduce transfer for large reads and writes while keeping existing unchunked cache paths working. The server-side cache path can skip chunks that already exist, move the chunks that are missing, and transfer those chunks in parallel.

Executors

Executors can upload large action outputs as chunks directly. The executor walks outputs normally, uses the negotiated chunking parameters to compute chunk digests for large files, calls FindMissingBlobs, and uploads only the missing chunks. The uploader can read the needed byte ranges from the original file and upload chunks concurrently, instead of keeping a second full copy in memory.

This means CDC can help on multiple hops: Bazel client to BuildBuddy, executor to BuildBuddy, and internal server-side cache traffic. Native Split/Splice-aware clients get end-to-end chunked transfers, while existing clients can still use the normal cache APIs.

Availability

Bazel support for CDC was introduced in bazelbuild/bazel#28437, and remote cache CDC is available in Bazel 8.7 and 9.1+.

Bazel clients using BuildBuddy can opt in to local client-side upload/download savings with:

bazel build //... --experimental_remote_cache_chunking

BuildBuddy servers currently have CDC enabled for large files flowing through the server-side cache path. Self-hosted executor users should run BuildBuddy executor v2.261.0 or newer for full CDC benefits. No executor config is required; CDC-eligible execution requests enable it automatically.

Closing

CDC makes remote caching better at what developers actually do all day: make small changes to large codebases that sometimes produce large outputs. Instead of uploading and downloading the same bytes again and again, BuildBuddy and Bazel can now reuse the chunks that did not change, significantly cutting down on cache transfer.

Try it today with Bazel 8.7 or 9.1+ by setting --experimental_remote_cache_chunking on your BuildBuddy cache-enabled Bazel builds.

Further Reading and References

Bazel and BuildBuddy:

Remote APIs:

Content-defined chunking: