Prometheus Metrics
BuildBuddy exposes Prometheus metrics that allow monitoring the four golden signals: latency, traffic, errors, and saturation.
Prometheus metrics are exposed under the path metrics/
on port 9090
by default.
To view these metrics in a live-updating dashboard, we recommend using a tool like Grafana.
Invocation build event metrics
All invocation metrics are recorded at the end of each invocation.
buildbuddy_invocation_count
(Counter)
The total number of invocations whose logs were uploaded to BuildBuddy.
Labels
- invocation_status: Invocation status:
success
,failure
,disconnected
, orunknown
. - bazel_exit_code: Exit code of a completed bazel command
- bazel_command: Command provided to the Bazel daemon:
run
,test
,build
,coverage
,mobile-install
, ...
Examples
# Number of invocations per second by invocation status
sum by (invocation_status) (rate(buildbuddy_invocation_count[5m]))
# Invocation success rate
sum(rate(buildbuddy_invocation_count{invocation_status="success"}[5m]))
/
sum(rate(buildbuddy_invocation_count[5m]))
buildbuddy_invocation_duration_usec
(Histogram)
The total duration of each invocation, in microseconds.
Labels
- invocation_status: Invocation status:
success
,failure
,disconnected
, orunknown
. - bazel_command: Command provided to the Bazel daemon:
run
,test
,build
,coverage
,mobile-install
, ...
Examples
# Median invocation duration in the past 5 minutes
histogram_quantile(
0.5,
sum(rate(buildbuddy_invocation_duration_usec_bucket[5m])) by (le)
)
buildbuddy_invocation_build_event_count
(Counter)
Number of build events uploaded to BuildBuddy.
Labels
- status: Status code as defined by grpc/codes. This is a numeric value; any non-zero code indicates an error.
Examples
# Build events uploaded per second
sum(rate(buildbuddy_invocation_build_event_count[5m]))
# Approximate error rate of build event upload handler
sum(rate(buildbuddy_invocation_build_event_count{status="0"}[5m]))
/
sum(rate(buildbuddy_invocation_build_event_count[5m]))
buildbuddy_invocation_stats_recorder_workers
(Gauge)
Number of invocation stats recorder workers currently running.
buildbuddy_invocation_stats_recorder_duration_usec
(Histogram)
How long it took to finalize an invocation's stats, in microseconds.
This includes the time required to wait for all BuildBuddy apps to flush their local metrics to Redis (if applicable) and then record the metrics to the DB.
buildbuddy_invocation_webhook_invocation_lookup_workers
(Gauge)
Number of webhook invocation lookup workers currently running.
buildbuddy_invocation_webhook_invocation_lookup_duration_usec
(Histogram)
How long it took to lookup an invocation before posting to the webhook, in microseconds.
buildbuddy_invocation_webhook_notify_workers
(Gauge)
Number of webhook notify workers currently running.
buildbuddy_invocation_webhook_notify_duration_usec
(Histogram)
How long it took to post an invocation proto to the webhook, in microseconds.
Remote cache metrics
NOTE: Cache metrics are recorded at the end of each invocation, which means that these metrics provide approximate real-time signals.
buildbuddy_remote_cache_events
(Counter)
Number of cache events handled.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - cache_event_type: Cache event type:
hit
,miss
, orupload
.
buildbuddy_remote_cache_download_size_bytes
(Histogram)
Number of bytes downloaded from the remote cache in each download.
Use the _sum
suffix to get the total downloaded bytes and the _count
suffix to get the number of downloaded files.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - server_name: Describes the name of the server that handles a client request, such as "byte_stream_server" or "cas_server"
Examples
# Cache download rate (bytes per second)
sum(rate(buildbuddy_cache_download_size_bytes_sum[5m]))
buildbuddy_remote_cache_download_duration_usec
(Histogram)
Download duration for each file downloaded from the remote cache, in microseconds.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
Examples
# Median download duration for content-addressable store (CAS)
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_cache_download_duration_usec{cache_type="cas"}[5m])) by (le)
)
buildbuddy_remote_cache_upload_size_bytes
(Histogram)
Number of bytes uploaded to the remote cache in each upload.
Use the _sum
suffix to get the total uploaded bytes and the _count
suffix to get the number of uploaded files.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - server_name: Describes the name of the server that handles a client request, such as "byte_stream_server" or "cas_server"
Examples
# Cache upload rate (bytes per second)
sum(rate(buildbuddy_cache_upload_size_bytes_sum[5m]))
buildbuddy_remote_cache_upload_duration_usec
(Histogram)
Upload duration for each file uploaded to the remote cache, in microseconds.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
Examples
# Median upload duration for content-addressable store (CAS)
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_cache_upload_duration_usec{cache_type="cas"}[5m])) by (le)
)
buildbuddy_remote_cache_disk_cache_last_eviction_age_usec
(Gauge)
The age of the item most recently evicted from the cache, in microseconds.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_eviction_age_msec
(Histogram)
Age of items evicted from the cache, in milliseconds.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_num_evictions
(Counter)
Number of items evicted.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_size_bytes_evicted
(Counter)
Number of bytes in the partition evicted.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_size_bytes
(Gauge)
Number of bytes in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_capacity_bytes
(Gauge)
Number of bytes in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_num_items
(Gauge)
Number of items in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_disk_cache_duplicate_writes
(Counter)
Number of writes for digests that already exist.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_added_file_size_bytes
(Histogram)
Size of artifacts added to the file cache, in bytes.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_filesystem_total_bytes
(Gauge)
Total size of the underlying filesystem.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_filesystem_avail_bytes
(Gauge)
Available bytes in the underlying filesystem.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
Examples
# Total number of duplicate writes.
sum(buildbuddy_remote_cache_duplicate_writes)
buildbuddy_remote_cache_disk_cache_duplicate_writes_bytes
(Counter)
Number of bytes written that already existed in the cache.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_migration_not_found_error_count
(Counter)
Number of not found errors from the destination cache during a cache migration.
Labels
- type: Describes the type of cache request
buildbuddy_remote_cache_migration_double_read_hit_count
(Counter)
Number of double reads where the source and destination caches hold the same digests during a cache migration.
Labels
- type: Describes the type of cache request
buildbuddy_remote_cache_migration_copy_chan_size
(Gauge)
Number of digests queued to be copied during a cache migration.
buildbuddy_remote_cache_migration_bytes_copied
(Counter)
Number of bytes copied from the source to destination cache during a cache migration.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_migration_blobs_copied
(Counter)
Number of blobs copied from the source to destination cache during a cache migration.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_tree_cache_lookup_count
(Counter)
Total number of TreeCache lookups.
Labels
- status: The TreeCache status: hit/miss/invalid_entry.
buildbuddy_remote_cache_tree_cache_set_count
(Counter)
Total number of TreeCache sets.
Remote execution metrics
buildbuddy_remote_execution_count
(Counter)
Number of actions executed remotely.
This only includes actions which reached the execution phase. If an action fails before execution (for example, if it fails authentication) then this metric is not incremented.
Labels
- exit_code: Process exit code of an executed action.
- status: Status code as defined by grpc/codes in human-readable format, such as "OK" or "NotFound".
- isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
Examples
# Total number of actions executed per second
sum(rate(buildbuddy_remote_execution_count[5m]))
buildbuddy_remote_execution_tasks_started_count
(Counter)
Number of tasks started remotely, but not necessarily completed.
Includes retry attempts of the same task.
buildbuddy_remote_execution_executed_action_metadata_durations_usec
(Histogram)
Time spent in each stage of action execution, in microseconds.
Queries should filter or group by the stage
label, taking care not to aggregate different stages.
Labels
- stage: Executed action stage. Action execution is split into stages corresponding to the timestamps defined in
ExecutedActionMetadata
:queued
,input_fetch
,execution
, andoutput_upload
. An additional stage,worker
, includes all stages during which a worker is handling the action, which is all stages except thequeued
stage. - group_id: Group (organization) ID associated with the request.
Examples
# Median duration of all command stages
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_execution_executed_action_metadata_durations_usec_bucket[5m])) by (le, stage)
)
# p90 duration of just the command execution stage
histogram_quantile(
0.9,
sum(rate(buildbuddy_remote_execution_executed_action_metadata_durations_usec_bucket{stage="execution"}[5m])) by (le)
)
buildbuddy_remote_execution_task_size_read_requests
(Counter)
Number of read requests to the task sizer, which estimates action resource usage based on historical execution stats.
Labels
- status: Status of the task size read request:
hit
,miss
, orerror
. - isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_task_size_write_requests
(Counter)
Number of write requests to the task sizer, which estimates action resource usage based on historical execution stats.
Labels
- status: Status of the task size write request:
ok
,missing_stats
orerror
. - isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_task_size_prediction_duration_usec
(Histogram)
Task size prediction model request duration in microseconds.
Labels
- status: Status code as defined by grpc/codes in human-readable format, such as "OK" or "NotFound".
buildbuddy_remote_execution_waiting_execution_result
(Gauge)
Number of execution requests for which the client is actively waiting for results.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Total number of execution requests with client waiting for result.
sum(buildbuddy_remote_execution_waiting_execution_result)
buildbuddy_remote_execution_requests
(Counter)
Number of execution requests received.
Labels
- group_id: Group (organization) ID associated with the request.
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
buildbuddy_remote_execution_executor_registration_count
(Counter)
Number of executor registrations on the scheduler.
Labels
- version: Binary version. Example:
v2.0.0
.
Examples
# Rate of new execution requests by OS/Arch.
sum(rate(buildbuddy_remote_execution_requests[1m])) by (os, arch)
buildbuddy_remote_execution_merged_actions
(Counter)
Number of identical execution requests that have been merged.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Rate of merged actions by group.
sum(rate(buildbuddy_remote_execution_merged_actions[1m])) by (group_id)
buildbuddy_remote_execution_queue_length
(Gauge)
Number of actions currently waiting in the executor queue.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Median queue length across all executors
quantile(0.5, buildbuddy_remote_execution_queue_length)
buildbuddy_remote_execution_tasks_executing
(Gauge)
Number of tasks currently being executed by the executor.