A small cross-platform benchmark collector for performance-testing fleets (pools of Taskcluster worker hosts that run Firefox perf tests), plus a Python runner that wraps it for use on those hosts.
The collector currently ships two workloads — CPU (prime sieve, single-
and multi-threaded) and ADB I/O (timed adb push/pull against an
attached Android device) — alongside an inspect mode for host metadata.
Both workloads emit the same envelope shape so a single analysis pipeline
consumes them.
Fleetbench produces raw per-iteration timings and host metadata as versioned JSON. It does not score hosts, compare across hardware classes, or maintain fleet-wide state — that work belongs to a downstream analysis layer fed from the collected envelope files.
collector/— Rust binary (fleetbench). Single-host-aware, emits one JSON object per invocation on stdout. No filesystem opinions.runner/— Python package (fleetbench-run). Wraps the collector, self-throttles, writes envelope files to disk.docs/fleetbench_design_v2.md— design doc. Start here.analysis_notes.md— guidance for the downstream analysis layer (use median, drop iter 0, etc.).
| Component | Linux | Windows | macOS | Android |
|---|---|---|---|---|
| Collector | shipped | binary cross-compiles, env sampling fields are null pending implementation | shipped (env block intentionally null — no /proc on Darwin) |
shipped (env block populated; same /proc/stat + /proc/loadavg path as Linux) |
| Runner | shipped | deferred pending CPython availability question | works (dev) | not applicable — Android deploy model is different |
The collector is a single binary (fleetbench) with three peer subcommands:
| Subcommand | What it does | Where it runs | Output section |
|---|---|---|---|
inspect |
Host + CPU metadata only, no workload | Any host | (just host/cpu, no results) |
cpu |
Prime-sieve workload (1t + MT), optional time-bounded torture mode with per-core frequency sampling | Any host (Linux/Windows/macOS/Android) | results.prime_sieve_1t / results.prime_sieve_mt (+ frequency_series in --duration mode) |
adb |
Times adb push / adb pull against an attached Android device; pre-generated random payloads, SHA256-verified per iteration |
Linux/macOS host that has adb and a phone attached — not the phone itself |
adb_results.iterations |
Every invocation emits a single JSON envelope with the same top-level shape
(schema_version, host, environment, plus suite-specific *_config,
*_env, and *_results siblings). Downstream tools branch on which
*_config block is present.
fleetbench inspect # human-readable
fleetbench inspect --json # envelope with host/cpu populated, no workloadUseful as a quick "what is this host?" check, and as a smoke test that the binary runs on the target at all before kicking off a workload.
The default fleet workload: a prime-sieve up to prime_limit, run both
single-threaded and across all cores. Calibrated for per-iteration timings
above the noise floor on slow-x86 fleet hardware.
fleetbench cpu --json # --mode normal, all logical CPUs
fleetbench cpu --mode quick --json # CI / dev cycles
fleetbench cpu --mode long --json # fast hardware
fleetbench cpu --mode quick --duration 10m --json # torture / throttle huntnormal (pi(10⁸), 5 iterations) targets ~150 ms per iteration on slow-x86 fleet
hosts (Xeon E3-class), which is where signal quality matters most. On much
faster hardware — M-class Macs, modern workstations — per-iteration timing
drops to ~90 ms, which is below the ~100 ms noise floor for tight outlier
detection. Use --mode long (pi(10⁹), 3 iterations) on hardware that fast
to keep iterations comfortably above the noise floor. Slow phones and old
fleet hardware are well-served by normal.
--duration <30s|10m|1h> switches the cpu subcommand into a time-bounded
sustained-load run intended for thermal-throttle investigations — not the
default fleet cadence. The MT sieve loops until the wall-clock duration
elapses; the 1t workload is skipped so all cores stay hot continuously. A
background sampler captures per-core CPU frequency at ~1Hz into the envelope
as frequency_series, which is the direct signal for thermal throttling
(boost-clock samples decaying toward base-clock over the run).
How --mode interacts with --duration. This trips people up: in
duration mode, --mode picks only the per-iteration size (prime_limit).
The preset's iteration count is ignored — total iterations are whatever
completes before the deadline. Reading --mode long --duration 10m as
"the longest mode" produces a handful of multi-second iterations, not a
denser long run.
--mode (with --duration) |
per-iteration time on a fast NUC | iterations in 10 min |
|---|---|---|
quick (pi(10⁷)) |
~15 ms | ~40,000 |
normal (pi(10⁸)) |
~150 ms | ~4,000 |
long (pi(10⁹)) |
~1.5 s | ~400 |
For torture runs, --mode quick --duration 10m is the natural pairing — it
gives a dense per-iteration time series alongside the 1Hz frequency_series.
--mode long still works (run_mt_until guarantees at least one iteration)
but iteration-time drift becomes a coarse signal; frequency_series carries
the throttle evidence either way.
For the full workflow — fetching the release binary, running a torture
test, and reading the output to decide whether a host is throttling — see
docs/detecting_thermal_throttling.md.
fleetbench adb times adb push and adb pull against an attached Android
device. It runs on the Linux Docker host where adb lives, not on the device
itself — the goal is to characterize USB/adb behavior (the path raptor sees
when staging APKs and test files), and to debug "why is provisioning slow
today?" style problems across vendors (e.g. bitbar vs LambdaTest).
fleetbench adb --json # all defaults
fleetbench adb --serial <id> --json # multi-device host
fleetbench adb --sizes 25B,1M --iterations 25B=50,1M=20 --json
fleetbench adb --remote-path /sdcard/Download --json # reproduce raptor's pathOperational model:
- One invocation, one device. Contention is observed by running many
invocations concurrently at the Taskcluster layer — that matches how real
tests behave. There is no in-collector
--parallelmode. - Target selection. With one device attached, no flag is needed. With
multiple, pass
--serial; otherwise the run fails withmultiple_devices. - Remote path. Defaults to
/data/local/tmp/to avoid the FUSE layer on/sdcardfor a cleaner USB/adb signal. Use--remote-path /sdcard/Downloadwhen the goal is to reproduce raptor's path exactly. - Payloads. For each size, N unique random files are generated up front (xorshift64 fill) so the kernel page cache can't quietly accelerate later iterations. Pre-generation happens before the timed section.
- Verification. Push is checked via
adb shell sha256sum; pull is checked by hashing the file locally. A failed hash setssha256_ok = falseon that iteration and exits non-zero (exit 2, correctness failure). - Sizes & iterations. Defaults emphasize the 25-byte point (where vendor variance shows up — that workload is dominated by command/setup overhead, not bytes on the wire), then progressively larger transfers:
| size | default iterations | what it measures |
|---|---|---|
| 25B | 200 | adb command/setup latency (no real bytes on wire) |
| 1M | 100 | small-transfer steady state |
| 10M | 30 | mid-transfer steady state |
| 100M | 10 | bulk-transfer USB throughput ceiling |
Override iterations per size via --iterations 25B=50,1M=20,....
A full default run does ~720 timed transfers and takes 10-30 minutes on a real device (longer on slow USB hubs). For a quick smoke test:
fleetbench adb --iterations 25B=5,1M=2,10M=2,100M=1 --json- Output. Per-iteration timings are emitted raw — no median/IQR/summary. The distribution is the signal; the mean often is not. (In a 100-retrigger bitbar-vs-LT comparison, LT's mean was lower but its distribution width was 4-5× wider; that's the kind of thing this subcommand surfaces.)
- Env capture.
adb --versionis recorded inadb_env, and on Linux hosts the fulllsusb -ttopology is captured for hub-path correlation across concurrent invocations.
cpu:
- Linux: smoke-tested on real fleet hosts (Xeon E3-1585L v5).
- macOS: dev box (Apple Silicon M4 Pro); pi(10⁹) 1t in ~840 ms, mt in ~118 ms across 14 cores.
- Android: Pixel 10 Pro via
adb push. Seedocs/analysis_notes.mdfor Android-specific behavior the analysis layer needs to know about (governor ramp, big.LITTLE + thermal throttling, non-zero idle load averages).
adb:
- macOS + real phone: dev box (Apple Silicon M4 Pro) with a Pixel 10 Pro over USB; 21/21 iterations passed SHA256 verification across 25B / 1M / 10M / 100M. 25B transfers ran ~25-46 ms (pure adb command/setup overhead), 100M transfers hit ~34 MB/s push and ~39 MB/s pull (pull consistently faster — known adb asymmetry).
- Linux + real phone: bitbar/LT-style Docker host validation is
environmental, not a code path — the Linux-only env capture (
/proc/stat,/proc/loadavg,lsusb -t) is the same code that ships incpuand is exercised by that command's Linux fleet runs.
cpu.frequency_mhzisnullon macOS — Apple Silicon doesn't expose a single meaningful peak frequency and sysinfo's value is unreliable, so we deliberately drop it rather than emit a misleading number.cpu.brandis null on Android (sysinfo doesn't parse the SoC name from/proc/cpuinfoon ARM); workaround if needed: parse it directly.adb_env.lsusb_topologyis only captured on Linux hosts (nolsusbon macOS/Windows).
cd collector
cargo build --release # native build for dev
./build # build all four (linux + windows + mac + android)
./build --platform linux # just the linux musl binary
./build --platform windows # just the windows .exe
./build --platform mac # just the mac host-arch binary
./build --platform android # aarch64 Android (requires NDK)./build produces:
target/x86_64-unknown-linux-musl/release/fleetbench(~1.1 MB, static, runs on any modern Linux including Ubuntu 18.04)target/x86_64-pc-windows-gnu/release/fleetbench.exe(~1.0 MB)target/<host-arch>-apple-darwin/release/fleetbench(~1.1 MB)target/aarch64-linux-android/release/fleetbench
Every binary embeds version + git SHA as a tagged sentinel string. Three ways to read it, in order of effort:
# 1. From any machine (Mac, Linux), even for a Windows .exe:
strings -a fleetbench[.exe] | grep FLEETBENCH_BUILD
# FLEETBENCH_BUILD=0.1.0+3eb69d100e10
# (suffix "-dirty" appears if the build had uncommitted tracked changes)
# 2. Run the binary itself:
fleetbench --version
# fleetbench 0.1.0 (3eb69d100e10)
# 3. Look at any envelope it produced — collector_git_sha is in the JSON.When sharing a build, paste the FLEETBENCH_BUILD=... line so the recipient
can confirm they're running what you sent.
Linux and Windows builds cross-compile via cargo-zigbuild; the Mac build
uses the native Apple toolchain; the Android build uses cargo-ndk.
Tooling: brew install zig, cargo install cargo-zigbuild cargo-ndk,
and the rustup targets:
rustup target add x86_64-unknown-linux-musl x86_64-pc-windows-gnu \
aarch64-apple-darwin aarch64-linux-androidAndroid additionally needs the NDK. With Homebrew:
brew install --cask android-ndk
export ANDROID_NDK_HOME="$(brew --prefix)/share/android-ndk"Add the export to your shell rc so it persists. Android Studio's SDK
Manager also works; in that case ANDROID_NDK_HOME points at the SDK's
ndk/<version>/ directory instead.
cd runner
uv sync # creates .venv, installs deps including pytest
uv run pytest -q # 98 tests
uv run fleetbench-run --helpcollector/smoke builds the binary, scps it to a target host, runs a
sequence of validation checks, and prints a per-run timing table plus
aggregate iter-0/iter-1+ distributions.
cd collector
./smoke <linux-host> --runs 5 --mode normal
./smoke <windows-host> --platform windows --runs 3 --mode normalThe smoke does:
cargo zigbuildfor the target platform.scpthe binary to the host's home dir.gwhc --jsonactivity check (Linux only; skipped silently elsewhere).inspectfor host/CPU metadata.- N runs of
cpu --jsonwith full schema validation per envelope. - Negative test:
--threads 0 --jsonmust produce a failure envelope and exit 1.
If gwhc reports a non-IDLE state, smoke exits 0 with a summary rather than
running benchmarks against a contaminated baseline.
./smoke does not yet wire Android. Use adb directly:
cd collector
./build --platform android
adb push target/aarch64-linux-android/release/fleetbench /data/local/tmp/fleetbench
adb shell chmod 755 /data/local/tmp/fleetbench
adb shell /data/local/tmp/fleetbench inspect
adb shell /data/local/tmp/fleetbench cpu --mode quick --json/data/local/tmp/ is the standard "anyone can push and execute" path on
Android. The collector emits the same v3 envelope as on Linux, with
host.os_family = "android" and a populated environment block from the
same /proc/stat + /proc/loadavg reads. adb shell exit codes are
historically unreliable; trust the JSON's status field, not $?.
Invoked by the worker-startup wrapper before the Taskcluster worker boots.
Self-throttles based on the newest envelope timestamp in the results
directory (--min-interval, default 24h). Pre-flights the host via gwhc
on Linux and skips runs against non-IDLE hosts. Writes one envelope file per
run, success or failure, via .partial + atomic rename. See
the design doc for the full contract.
fleetbench-run \
--results-dir /var/lib/fleetbench \
--mode normal \
--collector-binary /usr/local/bin/fleetbench \
--min-interval 24hA possible companion model is to run the collector inside dedicated Taskcluster jobs targeted at specific worker pools, with a small controller tool that enqueues the jobs, records their IDs, polls for completion, and pulls the envelope artifacts back. Useful for targeted sweeps ("benchmark every gecko_t_linux_talos host now, before/after this kernel change") rather than continuous drift detection.
Tradeoffs noted but not yet committed work:
- Queue contention. Benchmark jobs compete with real test traffic for worker time; on a busy queue, hourly or even daily fleet sweeps could end up waiting behind production work. The boot-throttle model sidesteps this by slipping into a window where the worker is not taking tasks.
- Per-job overhead. TC task scheduling, image pull, and log shipping for what's a ~5 second benchmark is wasteful compared to direct invocation.
- Visibility cost. Every benchmark becomes a TC entity that shows up in task dashboards.
A TC-driven invocation does not require a new runner — the existing
fleetbench-run would just need a taskcluster value added to its
--trigger enum and invocation from inside the task. Filing as a real
beads task is deferred until someone needs the controlled-sweep capability.
Binaries are intended to ship via GitHub releases, tagged per version. This is the primary distribution channel because:
- Any Taskcluster task on any worker (including bitbar Android phones where Mozilla does not own the host OS layer) can fetch a release asset directly.
- Releases are immutable per tag, so cross-version benchmark comparisons reference a stable build.
- TC's
fetchesmechanism caches external URLs automatically.
Release asset naming follows a templatable convention so task definitions can be written once and parameterized by version:
fleetbench-<version>-linux-x86_64
fleetbench-<version>-windows-x86_64.exe
fleetbench-<version>-macos-aarch64
fleetbench-<version>-android-aarch64
SHA256SUMS
A SHA256SUMS file alongside the binaries enables fetch-time integrity
verification (sha256sum -c) and lets TC fetches pin a hash per asset.
Releases are built and published automatically by
.github/workflows/release.yml on any
v* tag push. The latest release is at
releases/latest.
For local development builds outside the release pipeline, use ./build
as documented above.
A Taskcluster task can fetch and run the collector directly from a release. Sketch for an Android worker (the motivating case — bitbar phones where Mozilla does not own the host OS layer):
payload:
maxRunTime: 600
mounts:
- file: fleetbench
content:
url: https://github.com/<owner>/fleetbench/releases/download/v0.2.0/fleetbench-v0.2.0-android-aarch64
sha256: "<pinned-hash-from-SHA256SUMS>"
command:
- - /bin/sh
- -c
- "chmod 755 fleetbench && ./fleetbench cpu --mode quick --json > result.json"
artifacts:
- name: public/result.json
type: file
path: result.jsonThe same pattern applies on Linux and Windows TC workers — just swap the
release asset URL for the matching platform. A downstream controller tool
(see "Alternative: Taskcluster jobs" above) would enqueue these tasks,
collect the public/result.json artifacts, and drop them into the same
flat results/ layout the runner uses.
Tasks live in .beads/ via beads_rust;
see AGENTS.md for workflow conventions.