Skip to content

ci: add C++ ASan+TSan gate on every PR; document test strategy#539

Merged
jbachorik merged 38 commits into
mainfrom
jb/ci-sanitizer-split
May 26, 2026
Merged

ci: add C++ ASan+TSan gate on every PR; document test strategy#539
jbachorik merged 38 commits into
mainfrom
jb/ci-sanitizer-split

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 24, 2026

What does this PR do?:
Adds a native-sanitizer-tests CI job that runs the C++ gtest suite under ASan and TSan on every PR (amd64 + aarch64). Adjusts the nightly to skip C++ gtests since they now run earlier. Adds a comprehensive testing guide documenting all four test tiers.

Motivation:
The crashes fixed over the last two weeks (StringDictionary rb_tree race, RefCountGuard dying-slot reentrance, null calltrace buffer, isReadableRange overflow) are data races in C++ signal handler paths. TSan on the C++ unit tests would have caught them immediately — but TSan was only run on Java functional tests where the JVM generates an unmanageable volume of false positives, making it useless there.

The fix is to separate the two concerns: run C++ gtests (no JVM) under both ASan and TSan on every PR, and keep Java functional tests under ASan in the nightly where they belong.

Additional Notes:

  • native-sanitizer-tests is intentionally independent of the test:asan/test:tsan PR labels — those remain for optional Java functional test coverage, but C++ sanitizer coverage is now unconditional.
  • The nightly gains a skip_gtest: true input so it no longer duplicates C++ gtest runs.
  • doc/build/TestingGuide.md was proof-checked against the actual source — six hallucinations corrected before merge (non-existent test file, wrong Docker flags, wrong test class names, wrong trigger description, non-existent version.txt).

How to test the change?:

  • CI on this PR will run the new native-sanitizer-tests job — both gtestAsan and gtestTsan must pass on amd64 and aarch64.
  • The existing test-matrix (debug) continues to run as before.
  • The nightly can be triggered manually (workflow_dispatch) to verify skip_gtest takes effect.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@jbachorik jbachorik added the AI label May 24, 2026
@jbachorik jbachorik force-pushed the jb/ci-sanitizer-split branch from 9bf4f59 to 651f647 Compare May 24, 2026 10:02
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 24, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 4 Pipeline jobs failed

DataDog/java-profiler | gtest-asan-arm64   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Memory leak detected in Arguments::parse at /go/src/github.com/DataDog/java-profiler/ddprof-lib/src/main/cpp/arguments.cpp:119 and RemoteArgsTest_RemoteSymbolicationNoVariant_Test at /go/src/github.com/DataDog/java-profiler/ddprof-lib/src/test/cpp/remoteargs_ut.cpp:118.

DataDog/java-profiler | gtest-tsan-amd64   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Container failed to initialize due to unready status: [build helper docker]

DataDog/java-profiler | gtest-asan-amd64   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. ContainersNotInitialized: containers with incomplete status: [emissary init-permissions]

View all 4 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fe0d85d | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 24, 2026

CI Test Results

Run: #26458303058 | Commit: 1850235 | Duration: 14m 29s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-26 15:53:09 UTC

jbachorik and others added 4 commits May 24, 2026 16:15
- Use clang's own ASan runtime (libclang_rt.asan) instead of GCC's
  libasan when the compiler is clang; mixing the two caused
  "incompatible ASan runtimes" at startup
- Add rpath pointing to the clang runtime dir so the binary finds it
  at load time
- Strip explicit -lasan/-lubsan from linker args when clang provides
  the runtime implicitly via -fsanitize=address
- Enumerate Architecture exhaustively in locateLibasan so new arches
  don't silently fall through
- Drop LD_PRELOAD from gtest Exec task environment: gtest binaries
  already link the sanitizer runtime; preloading it again causes the
  same "incompatible runtimes" crash

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two patterns in ddprof_ut crash TSan before it can write any report:

- installGtestCrashHandler installs SIGSEGV/SIGBUS/SIGABRT handlers
  that override TSan's own signal interception. No-op under
  __SANITIZE_THREAD__ so TSan keeps control of crash reporting.
- fork() is unsupported by TSan: the child inherits shadow memory in
  an inconsistent state and segfaults immediately. Guard the
  CriticalSectionExitsEvenAfterTLSCleared test with
  #ifndef __SANITIZE_THREAD__.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions ubuntu-latest runners have vm.mmap_rnd_bits=32 and their
seccomp profile blocks personality(ADDR_NO_RANDOMIZE), which prevents
TSan's re-exec fallback. sysctl requires privileges that are not
available on GHA runners.

The Datadog internal GitLab runners are managed infrastructure with
kernel settings already tuned for stability, making them the right
home for native sanitizer tests.

Add .gitlab/sanitizer-tests/.gitlab-ci.yml with four jobs running on
every branch push (same trigger as dd-trace integration tests):
  gtest-asan-amd64, gtest-tsan-amd64, gtest-asan-arm64, gtest-tsan-arm64

The nightly GitHub Actions run gains skip_gtest: true so it focuses
on Java functional tests under ASan rather than duplicating the C++
gtest coverage now provided by GitLab.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TestingGuide.md documents the four test tiers — C++ gtests
(ASan+TSan via GitLab), Java functional tests (ASan nightly), dd-trace
integration (GitLab every push), and chaos/reliability (GitLab
scheduled) — covering what each tier catches, local run commands, and
why the split exists.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the jb/ci-sanitizer-split branch from b35af36 to 56186c9 Compare May 24, 2026 14:15
jbachorik and others added 21 commits May 24, 2026 16:16
The sanitizer gtest jobs compile and run C++ unit tests from source.
They need no artifacts from prepare:start or get-versions — just the
checkout and the build image.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Tier 1 now points to GitLab (.gitlab/sanitizer-tests) and the build stage
- Explain why GitLab rather than GitHub Actions (mmap_rnd_bits + seccomp)
- Fix prerequisites: drop libasan6/libtsan0 (Ubuntu 22.04-specific package
  names; runtimes now bundled with compiler)
- Update TSan failure note: report goes to stderr in GitLab job log

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
build-artifact is the pivot that all deploy/integration/reliability/
benchmark jobs depend on via needs. Adding the four gtest-*san-* jobs
to its needs list ensures a sanitizer failure blocks the entire
downstream pipeline while the native builds and sanitizer tests still
run in parallel within the build stage.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Add a sanitizer stage between prepare and build. The four gtest-*san-*
jobs run there with no explicit needs, so they start after the prepare
stage completes by normal stage ordering.

build:x64 / build:arm64 bypass stage ordering via their needs: list
so they still run in parallel with the sanitizer tests. build-artifact
(which everything downstream depends on) waits for both the native
builds AND the sanitizer jobs via its needs: list, ensuring a sanitizer
failure blocks the entire downstream pipeline.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sanitizer jobs need only the source checkout — no version.txt, no
build.env, nothing from prepare:start. needs: [] lets them fire at
pipeline start instead of waiting for the prepare stage.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Extend .cache-config-pull so Gradle wrapper is cached instead of
  downloading from services.gradle.org (unreachable from runners)
- Run sysctl vm.mmap_rnd_bits=28 before TSan; ignore failure if
  the runner doesn't permit it (best-effort; runners may already
  have correct settings)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ASan and UBSan reports were written to /tmp/asan_%p.log and
/tmp/ubsan_%p.log respectively, making them invisible in CI logs.
Without log_path both sanitizers write to stderr, which flows through
Gradle's Exec task output and appears directly in the job log.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
.cache-config-pull is read-only — on a cold cache the job would try
to download Gradle from services.gradle.org (unreachable from runners).
.cache-config uses push+pull: first run downloads and caches, all
subsequent runs hit the cache.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
GtestTaskBuilder: set standardOutput/errorOutput to System.out/System.err
so sanitizer reports (ASan/TSan/UBSan) appear directly in the CI log
instead of being swallowed at Gradle INFO level.

gradle-wrapper.properties: increase networkTimeout to 30s and retries to
5 with 2s back-off so transient connection resets don't abort the download
on the first attempt.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
System.out/System.err captured at Gradle configuration time go through
Gradle's console wrapper, which swallows child process output unless
--info is passed. Writing to FileOutputStream("/dev/stdout|stderr")
bypasses that entirely and ensures sanitizer reports appear in the CI log.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The default Exec task buffers child output in a ByteArrayOutputStream
and discards it on task failure — that is why sanitizer output never
reached the CI log even when Gradle theoretically captured it.

/dev/stdout and /dev/stderr stream bytes directly to fd 1/2 of the
Gradle JVM as they arrive. Explicit flush in doLast ensures the OS
buffer is drained before Gradle tears down the task.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The runners are Kubernetes pods. vm.mmap_rnd_bits is a non-namespaced
kernel parameter — sysctl silently did nothing inside the pod. TSan
crashed with SIGSEGV before writing any output.

setarch -R calls personality(ADDR_NO_RANDOMIZE) before execing Gradle.
ADDR_NO_RANDOMIZE is inherited across fork/exec so all children,
including the TSan binary, run with ASLR disabled. This works as long
as the Kubernetes seccomp profile allows personality(2), which is less
commonly blocked than sysctl.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When the Gradle daemon's fd 1 is not connected to the terminal (e.g.
Kubernetes CI), sanitizer output from test binaries is invisible.
buildGtest{Config} lets CI build the binaries via Gradle and then
execute them directly from the shell where stdout is guaranteed.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Gradle daemon's fd 1/2 are pipes to the wrapper process, not the
terminal. Even FileOutputStream("/dev/stdout") in the daemon writes to
those pipes which the wrapper logs at INFO level (invisible in CI).

New approach:
  1. ./gradlew :ddprof-lib:buildGtest{Asan,Tsan} — compile + link only
  2. Shell loop runs each binary directly — stdout/stderr go straight
     to the CI log with no Gradle indirection

setarch -R on both steps keeps ASLR disabled for the TSan binary.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The build image sets GRADLE_USER_HOME=/gradle-cache but .cache-config
saves .gradle/. The paths don't match so the Gradle distribution is
re-downloaded on every run. Override GRADLE_USER_HOME to .gradle so
both the save and restore operate on the same directory.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
setarch -R is only needed when running the TSan/ASan binary itself.
Gradle is plain Java and doesn't need ASLR disabled for compilation
and linking.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each of the 24 test binaries independently compiles all 60 library
sources. --parallel lets the compile tasks run concurrently.
--build-cache reuses compiled objects from the GitLab cache on
subsequent runs so only changed files are recompiled.

Long-term fix: compile library sources once via a shared static library
and link each test binary against it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each test binary previously recompiled all 59 library sources independently:
26 test files × 60 sources = 1,560 compilation units per config.

Add compileGtestLibrary{Config} which compiles the 59 library sources once.
Each per-test compile task now compiles only its own test file (1 source).
Link tasks pull in both the shared lib objects and the test-specific object.

Result: 1 + 27 = 28 compile tasks per config (was 27 × 60 = 1,620).
Cold build time drops from ~25 min to ~2 min on a 4-CPU Kubernetes pod.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Expose skipConditions() publicly so GtestPlugin can reuse it in the
  shared lib task onlyIf instead of duplicating the three property checks
- Close /dev/stdout and /dev/stderr FileOutputStreams in doLast to avoid
  leaked file descriptors (flush alone was insufficient)
- Remove redundant 'Include shared library objects' comment (code speaks)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TSan jobs are kept for coverage but marked allow_failure: true — they
provide signal when the environment allows TSan to run, and don't block
the pipeline when the kernel vDSO mapping conflicts with TSan shadow.
TSan jobs are optional in build-artifact needs: so the artifact builds
even when TSan can't initialize.

ASan jobs: drop setarch -R (not needed for ASan) and install llvm so
llvm-symbolizer is available for readable stack traces.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The Kubernetes pod's kernel vDSO is mapped at 0x002000000000 which
conflicts with TSan shadow. vm.mmap_rnd_bits is non-namespaced so it
cannot be set from an unprivileged container.

docker --privileged has CAP_SYS_ADMIN and CAN write non-namespaced
kernel parameters. Running the entire TSan build+test inside a
privileged Docker container (using the project's DinD infrastructure)
sets vm.mmap_rnd_bits=28 on the host kernel, resolving the mapping
conflict. TSan jobs remain allow_failure while this is validated.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
jbachorik and others added 10 commits May 24, 2026 17:52
docker --privileged is also unavailable on shared runners (no Docker
socket). All userspace approaches to set vm.mmap_rnd_bits are blocked.

TSan jobs keep allow_failure:true, run with setarch -R as a best-effort
attempt. They will pass when infrastructure is fixed — either:
  - Runner nodes get vm.mmap_rnd_bits=28 via a DaemonSet
  - personality() is allowed in the pod seccomp profile

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
From the infra wiki:
- docker-in-docker:amd64 = Kubernetes with Docker socket → use
  docker run --privileged to set vm.mmap_rnd_bits=28
- docker-in-docker:arm64 = EC2 VM → sysctl may work directly

Previously used arch:amd64 which is Kubernetes without Docker socket.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
gtest's default "fast" death test style uses plain fork(). The child
inherits TSan's internal state which is inconsistent after fork(),
causing an immediate SIGSEGV with no report.

"threadsafe" style uses fork()+exec() instead — the exec'd child
starts with a clean TSan instance and works correctly.

Only set in TSan job scripts, not in the shared ASan template.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Instead of wrapping the entire build+test in docker run --privileged
(which truncates output through Docker's pipe), use a throwaway Alpine
container just to set vm.mmap_rnd_bits=28 on the host kernel. Build
and run the TSan binaries directly in the outer shell using the build
image — same pattern as the arm64 EC2 job.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Alpine uses musl — wrong libc for anything in this pipeline. Use the
standard glibc build image which already has sysctl and is cached.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
docker-in-docker:amd64 runs on Kata Containers (kata-qemu runtime).
Each job gets its own isolated VM kernel — sysctl -w vm.mmap_rnd_bits=28
only affects that job's VM, not the shared Kubernetes node.

No docker run --privileged wrapper needed; the runner already has
privileged=true inside the micro VM. Identical pattern to arm64 EC2.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
amd64: revert to docker run --privileged for the sysctl (outer Kata
container lacks CAP_SYS_ADMIN for non-namespaced writes); pipe through
cat to force streaming and prevent output truncation

arm64: try kernel.randomize_va_space=0 (full ASLR disable) in addition
to vm.mmap_rnd_bits=28; add /proc/self/maps grep to identify what is
at 0x002000000000 so we know what is conflicting with TSan shadow

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
TSan crashes with SIGSEGV before printing anything. Likely LLVM 11 in
the build image is incompatible with the Kata micro VM kernel.
Probe points: clang version, kernel version, vm.mmap_rnd_bits, and a
verbosity=1 --gtest_list_tests run to see what TSan says on init.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
LLVM 11 (2020) TSan crashes on kernel 6.8 (Kata micro VM) during shadow
memory initialization. Install LLVM 18 from apt.llvm.org and compile
TSan binaries with clang++-18 via -Pnative.forceCompiler=clang++-18.

Long-term fix: update BUILD_IMAGE_X64 to include clang-18.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
arm64: remove kernel.randomize_va_space=0 — with ASLR fully off,
ld-linux-aarch64.so loads at its fixed default 0x002000000000 which is
exactly TSan's 39-bit shadow start. vm.mmap_rnd_bits=28 alone gives
TSan's LLVM re-exec a good chance of finding a clean layout.

amd64: drop LLVM 18 installation hack (Kata fixed mappings conflict with
any LLVM version). Revert to simple direct execution; documented why
Kata is fundamentally incompatible with TSan.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
jbachorik added a commit that referenced this pull request May 25, 2026
Each gtest test file was being compiled alongside all 59 library sources,
so the per-config build cost was O(n_tests * n_sources). On slow CI
runners this turned a debug gtest run into a multi-minute rebuild even
when only one test file changed.

Register a single compileGtestLibrary<Config> task that compiles the
library sources once per build configuration. Each test's compile task
now compiles only its own .cpp; each link task pulls in both its own
objects and the shared library objects via the new
GtestTaskBuilder.withSharedLibObjects() hook.

Also adds a buildGtest<Config> aggregation task that runs compile + link
without execution, for CI flows that invoke binaries directly.

Ported from PR #539 (cherry-picked the build-cache portion only;
sanitizer-runtime fixes from that PR are out of scope here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

@copilot resolve the merge conflicts in this pull request

Conflicts are resolved in commit 46394bb (merged origin/main and fixed the gtest build-logic conflicts).

@jbachorik jbachorik marked this pull request as ready for review May 26, 2026 15:34
@jbachorik jbachorik requested a review from a team as a code owner May 26, 2026 15:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe0d85ddac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/nightly.yml
Comment thread .gitlab/sanitizer-tests/.gitlab-ci.yml
@jbachorik jbachorik merged commit 57d301f into main May 26, 2026
106 of 119 checks passed
@jbachorik jbachorik deleted the jb/ci-sanitizer-split branch May 26, 2026 16:10
@github-actions github-actions Bot added this to the 1.43.0 milestone May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants