Skip to content

hot restart: propagate programmatic stat tags across restart#45674

Open
bpalermo wants to merge 3 commits into
envoyproxy:mainfrom
bpalermo:hot-restart-propagate-stat-tags
Open

hot restart: propagate programmatic stat tags across restart#45674
bpalermo wants to merge 3 commits into
envoyproxy:mainfrom
bpalermo:hot-restart-propagate-stat-tags

Conversation

@bpalermo

@bpalermo bpalermo commented Jun 16, 2026

Copy link
Copy Markdown

Commit Message: hot restart: propagate programmatic stat tags across restart

Additional Description:

Stats created with programmatic tags (via Scope::*FromStatNameWithTags / the tag-aware *FromTaggedName API) embed their tag values into the flat stat name while keeping a clean tagExtractedName and a set of tags. This works on a fresh process, but breaks across a hot restart:

  • The parent serializes each counter/gauge to the child as name string → value only (HotRestartingParent::Internal::exportStatsToChild); tags() / tagExtractedName() are dropped.
  • The child's StatMerger::mergeCounters re-creates the stat from the name alone, which runs only regex tag extraction. With no matching stats_tags regex, the merged stat gets empty tags and tagExtractedName == the full mangled name.
  • That no-tags stat is created before the child's own code first touches the stat, so it wins the central-cache slot. The later tagged creation returns the already-mangled stat (stats are immutable post-creation). The tag values end up baked into the metric name with empty labels (e.g. as exported to Prometheus).

This is the gap behind the long-standing // TODO(snowp): Propagate tag values during hot restarts in stat_merger.cc. Regex-extracted tags survive only because extraction re-derives them from the name; programmatic tags have nothing to re-derive from.

This change actually propagates the tags:

  • hot_restart.proto: new TaggedMetric (tag-extracted name + repeated Tag) and counter_tags / gauge_tags maps on Reply.Stats.
  • The parent emits an entry only for tags the child could not re-derive from the name: it re-runs the store's TagProducer on the flat name and skips the stat when extraction reproduces the stored tags (the common case — including all default tag regexes). The stats transfer payload is therefore unchanged except for stats that actually carry programmatic tags. A nullable Store::tagProducer() accessor (default nullptr; only ThreadLocalStoreImpl overrides) supports the comparison.
  • The child decodes the metadata and StatMerger re-creates the stat with the original tags via new Scope::counterFromMergedStatName / gaugeFromMergedStatName. Following stats: new tags friendly scope based API #45146 these have non-virtual-breaking defaults that delegate to the tag-aware counterFromTaggedName / gaugeFromTaggedName (which retain the metadata on tag-aware scopes and IsolatedStoreImpl); the legacy ThreadLocalStore ScopeImpl — which intentionally drops tag metadata on the *FromTaggedName path because it cannot compose tag-extracted names with its prefix in general — overrides them to honor the components, which arrive fully resolved. The full name is still reconstructed via DynamicContext (dynamic spans preserved), so the cache key matches the stat the child independently creates — the fix removes the first-write-wins poisoning rather than working around it.

Risk Level: Medium — changes stats labels after a hot restart for tag-bearing stats. Runtime guarded.

Testing: New tests: stat_merger_test (ProgrammaticTagsSurviveMerge, ProgrammaticGaugeTagsSurviveMerge, ProgrammaticTagsLostWithoutMetadata pinning the pre-fix symptom), hot_restarting_parent_test (ExportsTagMetadataOnlyWhenNotRederivable, TagMetadataAppliedToMergedStats end-to-end, TagMetadataIgnoredWhenRuntimeFlagDisabled), and thread_local_store_test (MergedStatNameHonorsSuppliedTags on both the legacy and tag-aware scopes). Built/run via RBE.

Docs Changes: N/A

Release Notes: Added (changelogs/current/bug_fixes/stats__hot-restart-propagate-programmatic-stat-tags.rst).

Runtime guard: envoy.reloadable_features.hot_restart_propagate_stat_tags (revert to legacy name-derived behavior).

@bpalermo bpalermo had a problem deploying to external-contributors June 16, 2026 23:12 — with GitHub Actions Error
@repokitteh-read-only

Copy link
Copy Markdown

Hi @bpalermo, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #45674 was opened by bpalermo.

see: more, trace.

@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45674 was opened by bpalermo.

see: more, trace.

Stats created via *FromStatNameWithTags inline their tag values into the
stat name. The hot restart stat merge re-created such stats from the name
alone with no tags, so after a restart the tag values collapsed into the
metric name and the labels went empty (e.g. as exported to Prometheus).
This resolves the long-standing "TODO(snowp): Propagate tag values during
hot restarts".

The parent now transmits the tag-extracted name and tag values for stats
that carry programmatic tags; the child re-creates them with identical
labels via new Scope::{counter,gauge}FromMergedStatName methods (default
implementations fall back to the prior name-derived behavior, so other
Scope implementations are unaffected). Guarded by
envoy.reloadable_features.hot_restart_propagate_stat_tags.

Signed-off-by: Bruno Palermo <b@palermo.dev>
@bpalermo bpalermo force-pushed the hot-restart-propagate-stat-tags branch from 0bdc37d to 481e3ea Compare June 17, 2026 10:19
@bpalermo bpalermo temporarily deployed to external-contributors June 17, 2026 10:19 — with GitHub Actions Inactive
@bpalermo bpalermo marked this pull request as ready for review June 17, 2026 11:16
@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #45674 was ready_for_review by bpalermo.

see: more, trace.

@botengyao

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where stats created with programmatic tags lost their labels across hot restarts. It introduces mechanisms to transmit tag-extracted names and tag values from the parent process to the child, allowing the child to correctly re-create the stats with identical labels. The review feedback focuses on minor performance and readability optimizations, including reserving space in maps and protobuf repeated fields to prevent unnecessary reallocations, and simplifying absl::Span construction by utilizing implicit conversions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread source/server/hot_restarting_child.cc
Comment thread source/server/hot_restarting_parent.cc
Comment thread source/common/stats/thread_local_store.cc Outdated
Comment thread source/common/stats/thread_local_store.cc Outdated
@paul-r-gall

Copy link
Copy Markdown
Contributor

please fix format and address ai-gen comments.

@RyanTheOptimist

Copy link
Copy Markdown
Contributor

/wait

bpalermo added 2 commits July 4, 2026 13:48
Resolves conflicts with the new tag-aware scope API (envoyproxy#45146):
counterFromMergedStatName/gaugeFromMergedStatName now default to the
tag-aware counterFromTaggedName/gaugeFromTaggedName (which retain tag
metadata on tag-aware scopes and isolated stores), and the legacy
ThreadLocalStore ScopeImpl overrides them via the tag-aware
TagStatNameJoiner, so the getOrCreate*Base refactor is no longer
needed.

Signed-off-by: Bruno Palermo <b@palermo.dev>
…add tests

- Export tag metadata from the hot restart parent only for stats whose
  tags the child cannot re-derive from the name via tag extraction
  (regex-extracted tags, including the default tag regexes, are
  reproduced by the child during the merge), keeping the stats transfer
  payload unchanged for stats without programmatic tags. Adds a
  nullable Store::tagProducer() accessor to support the comparison.
- Reserve map/repeated-field capacity in the tag conversion paths
  (review feedback).
- Fix spelling flagged by CI (inlines -> embeds).
- New tests: parent-side gating, end-to-end tag propagation through
  export/merge, runtime-flag-off legacy behavior, gauge merge path,
  and direct coverage of the merged-stat creation methods on legacy
  and tag-aware scopes.

Signed-off-by: Bruno Palermo <b@palermo.dev>
@bpalermo bpalermo requested a deployment to external-contributors July 4, 2026 17:50 — with GitHub Actions Waiting
@bpalermo

bpalermo commented Jul 4, 2026

Copy link
Copy Markdown
Author

Updated — changes since the review:

  • Merged main (no force-push): stats: new tags friendly scope based API #45146 (tag-aware scope API) landed in the meantime and refactored exactly this area. The merge adapts the fix to it and makes the diff smaller: counterFromMergedStatName/gaugeFromMergedStatName now default to the tag-aware *FromTaggedName (correct on tag-aware scopes and IsolatedStoreImpl with zero changes there), and only the legacy ThreadLocalStore::ScopeImpl — which intentionally drops tag metadata on that path — overrides them (via the tag-aware TagStatNameJoiner). The previous getOrCreate*Base signature refactor is gone.
  • Payload gating (new): while re-validating I realized the parent was emitting tag metadata for any stat with tags — which, with the default tag regexes, is most stats, bloating the hot-restart stats transfer for everyone. The parent now re-runs the store's TagProducer on the flat name and skips stats whose tags the child re-derives anyway; only true programmatic tags are transmitted. Added a nullable Store::tagProducer() accessor (default nullptr) to support this.
  • AI review comments addressed: the two reserve() suggestions applied; the two absl::Span conversion suggestions no longer apply (that code was removed in the stats: new tags friendly scope based API #45146 adaptation) — replied inline.
  • Format fixed: clang-format (18.1.8) + the two spell-check errors (inlinesembeds).
  • Coverage: added tests for the previously uncovered paths — parent-side gating, end-to-end export→merge tag propagation, runtime-flag-off legacy behavior, the gauge merge path, and the merged-stat creation methods on both legacy and tag-aware scopes. (The earlier tsan/arm64 failures were unrelated flakes: NetworkExtensionDiscoveryIntegrationTest, QUIC keepalive idle timeout.)

cc @paul-r-gall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants