Remote Cache CDC: Reusing Bytes

The goal: move the changed bytes, not the whole output.
BuildBuddy's Remote Cache uses Content-Defined Chunking (CDC) to make large build outputs behave more incrementally. When a binary, bundle, package, or archive is mostly unchanged, BuildBuddy can reuse chunks it has already seen instead of re-uploading or re-downloading the entire file.
In our Bazel chunking implementation PR, we observed 40% less data uploaded and a 40% smaller disk cache when benchmarked on BuildBuddy's own repo. To enable client-side CDC with BuildBuddy, use Bazel 8.7 or 9.1+ and pass --experimental_remote_cache_chunking.
Setting the Scene
The next frontier for build caching is not just skipping actions. It is skipping bytes.
Build caching has come a long way. Instead of rebuilding the world after every edit, Bazel and remote caching let teams reuse action outputs across machines and CI jobs. In practice, builds have moved from something closer to O(size of repo) toward O(size of change).
But "size of change" can be misleading. What really matters is the size of the transitive actions affected by the edit. A small source change can still ripple into many binaries, packages, bundles, and other large outputs, even when only a small part of each output actually changes.
That invalidation is expected. Build systems should rerun an action when its inputs change. The remote-cache problem is what happens next: the cache sees a new digest and moves the whole blob, even if that blob is mostly the same bytes as the previous version.
Transitive Actions
Linking, bundling, packaging, and archiving are where this shows up most often. They combine many transitive inputs into one output.
That makes them different from actions that operate on a small, direct set of files. A typical compile action might compile one source file using a smaller set of direct inputs. A transitive action, on the other hand, often consumes the accumulated outputs of many dependencies and produces one final binary, bundle, package, or archive.
In Bazel rules, this often shows up as a rule collecting files through a transitive depset and passing that accumulated set into a single action. For example, a simplified compile action might look like this:
ctx.actions.run(
inputs = [src] + direct_headers,
outputs = [obj],
executable = compiler,
arguments = ["-c", src.path, "-o", obj.path],
)
A bundling or packaging action often looks more like this:
transitive_inputs = depset(
direct = direct_files,
transitive = [dep[MyInfo].files for dep in ctx.attr.deps],
)
ctx.actions.run(
inputs = transitive_inputs,
outputs = [bundle],
executable = bundler,
arguments = ["--output", bundle.path],
)
That second shape is where small source changes can fan out into large output changes. The source edit might only change a small sequence of bytes in the final output, but the output digest is still new.
Without CDC, the cache treats that as a completely new blob, even when most of the binary, bundle, package, or archive is byte-for-byte identical to the previous version. If many final outputs depend on that changed input, they can all get new digests.
For remote caching, the expensive part is not just that the output is large. It is that the output is large and mostly similar to something the cache already has, but the whole-blob digest is new.
That creates two problems:
- Uploads and downloads move the whole blob, even when only a small part changed.
- Storage keeps another whole blob, even when most bytes are duplicates.
One workaround is to disable remote caching for these actions. That avoids uploading huge outputs when the expected cache hit is not worth the write cost, but it creates a different problem: the action now has to run every time. It can also make the action harder to move to remote execution, because RBE depends on moving action inputs and outputs efficiently.
So the build avoids one expensive cache write, but gives up reuse entirely.
A small source change can invalidate the final transitive action.
Case study: Go tests
A common example is a shared go_library, say foo, that is imported by many other libraries: bar1, bar2, through barN. Each bar library may also have its own go_test.
An implementation-only change in foo might only rebuild foo's own GoCompilePkg action. The downstream compile actions can often still hit cache because Go compilation depends on direct dependency export data, like foo.x, not the full transitive archive graph.
Linking is different. Each go_test needs a test binary, produced by a GoLink action, and that link action consumes the transitive set of Go archives, like foo.a. If foo.a changes, many downstream test binaries can get new digests even when their source and compile actions did not change. Finally, the TestRunner action needs that test binary as an input in order to run it.
That means one small source edit can create many new test binary digests. Those test binaries are often large, and many of them are mostly the same bytes as before. Without CDC, each one is still transferred and stored as a new whole blob.
Treating This as an Output Problem
One option would be to make the actions themselves incremental: incremental linking, runtime linking, smarter bundling, smarter packaging, and so on. But this is usually very difficult, and requires extensive changes to the linkers and tools themselves.
And even if we solved that for one tool, we would still need separate solutions for GoLink, C++ linkers, JavaScript bundlers, app packagers, generated archives, and every other action that can produce a large output. That does not scale.
Instead, we can treat this as a generic output problem: these actions create large files, where only a small amount of content is changing. With Content-Defined Chunking (CDC), we can leave the actions themselves untouched, while still getting many of the wins of making those actions incremental.