Skip to content

feat: Elastic CR e2e helper layer for sds-elastic suites (retire raw-Rook helpers)#25

Open
AleksZimin wants to merge 4 commits into
mainfrom
add-elastic
Open

feat: Elastic CR e2e helper layer for sds-elastic suites (retire raw-Rook helpers)#25
AleksZimin wants to merge 4 commits into
mainfrom
add-elastic

Conversation

@AleksZimin

@AleksZimin AleksZimin commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Adds a high-level Elastic CR helper layer to storage-e2e so module suites
(starting with sds-elastic) drive the module through its public
storage.deckhouse.io/v1alpha1 API — ElasticCluster / ElasticStorageClass
instead of poking the vendored Rook and csi-ceph CRs directly. The old
raw-Rook/Ceph helper set is retired in the same change (net −787 lines).

It also carries the two supporting pieces needed to actually run those suites
on freshly bootstrapped clusters: a per-module image-tag override and a
retrying NodeGroup create.

What's included

Elastic CR helper layer (new)

  • pkg/kubernetes/elasticcluster.go, pkg/kubernetes/elasticstorageclass.go
    create/wait/get/delete helpers for ElasticCluster and ElasticStorageClass,
    including readiness polling on their conditions.
  • pkg/testkit/elastic.go — testkit-level builders/constants (replication
    modes, ESC types) that suites compose against.
  • pkg/kubernetes/elasticrook.go — read-only helpers to inspect the vendored
    Rook CRs (renamed internal.sdselastic.deckhouse.io group) for assertions.
  • pkg/kubernetes/blockdevice.go — label/select raw BlockDevices that back OSDs.

Raw-Rook helpers removed

  • Deleted cephcluster.go, cephblockpool.go, cephclusterconnection.go,
    cephstorageclass.go, cephcredentials.go and pkg/testkit/{ceph,ceph_cluster}.go.
    Suites no longer manipulate Rook/csi-ceph CRs by hand; they go through the
    module's own CRs.

Per-module modulePullOverride env override

  • internal/config/overrides.go (+ tests): override a module's
    modulePullOverride at config load via a per-module env var
    (sds-elasticSDS_ELASTIC_MODULE_PULL_OVERRIDE, csi-ceph
    CSI_CEPH_MODULE_PULL_OVERRIDE; name upper-cased, non-[A-Z0-9]_).
    Lets CI pin the module-under-test to a pr<N>/mr<N> image without editing
    the committed cluster_config.yml. A single global tag is intentionally
    avoided so multi-module configs stay unambiguous; each applied override is
    logged at load time.

NodeGroup create robustness

  • pkg/kubernetes/nodegroup.go: wrap CreateStaticNodeGroup in a bounded
    retry (idempotent re-read each attempt). Right after dhctl bootstrap the
    node-manager validating webhook is often still unreachable, so a single-shot
    create raced it and failed the whole run with a transient
    failed calling webhook ... connect: operation not permitted / InternalError.
  • internal/config/config.go: NodeGroupTimeout 2m → 4m (now a retry budget),
    SecretsWaitTimeout 2m → 10m (bootstrap secrets/webhook convergence routinely
    exceed 2m on slower/nested clusters).

Docs

  • docs/{ARCHITECTURE,FUNCTIONS_GLOSSARY,WORKLOG}.md and README.md updated to
    reflect the new helper layer, the <MODULE>_MODULE_PULL_OVERRIDE contract,
    and the timeout/retry behavior.

Compatibility / notes

  • Breaking for any consumer that imported the removed raw-Rook/Ceph helpers; the
    intended replacement is the Elastic CR layer above.
  • The first consumer (sds-elastic e2e suite) pins this branch as a
    pseudo-version until it is merged and tagged.

Test plan

  • go build ./..., go vet ./... clean
  • go test ./internal/config/... (override normalization / precedence) passes
  • Downstream sds-elastic e2e suite (nested cluster: EC bootstrap, RBD/CephFS/HighRedundancy ESC round-trip, deletion guards, module-disable guard) runs green against this branch
  • Reviewer sanity-check on the removed-helper surface (no remaining in-repo importers)

Introduce the sds-elastic-layer helpers the e2e suite builds on:

- pkg/kubernetes/elasticcluster.go and elasticstorageclass.go: low-level
  create/wait/delete + condition/topology readers over the cluster-scoped
  storage.deckhouse.io/v1alpha1 EC/ESC CRs (unstructured + dynamic client).
- pkg/kubernetes/elasticrook.go: readiness verifiers for the renamed Rook
  group internal.sdselastic.deckhouse.io/v1 plus a discovery helper to assert
  the upstream ceph.rook.io group is absent.
- pkg/kubernetes/blockdevice.go: LabelBlockDevice to mark disks for OSD
  adoption via an ElasticCluster blockDeviceSelector.
- pkg/testkit/elastic.go: high-level EnsureElasticCluster/EnsureElasticStorageClass,
  Teardown* (with optional force-deletion), and EnsureElasticOSDBlockDevices.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
The csi-ceph-era raw-Rook builders (Ceph cluster/pool/storageclass/
connection/auth/credentials) are superseded by the dynamic-client
ElasticCluster/ElasticStorageClass helpers, which drive the same Ceph
backend through the sds-elastic CRDs. Drop the orphaned builders and
their testkit wrappers; relocate DefaultRookNamespace to the kept
rookconfigoverride.go (still used by ceph_crc.go) and refresh the
functions glossary. The CephFilesystem/CRC/config-override daemon
helpers are retained for the Rook-daemon stress paths.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Add ApplyModulePullOverrideEnv so each module's modulePullOverride can be
overridden at config load via a per-module env var (sds-elastic ->
SDS_ELASTIC_MODULE_PULL_OVERRIDE, csi-ceph -> CSI_CEPH_MODULE_PULL_OVERRIDE;
name upper-cased, non-[A-Z0-9] -> _). This lets CI pin the module-under-test
to a pr<N>/mr<N> image without editing the committed cluster_config.yml,
while the YAML stays literal and ${VAR} inside modulePullOverride remains
rejected. A single global tag is intentionally avoided so multi-module
configs are unambiguous. Each applied override is logged at load time,
naming both the static tag and the env-var tag that wins.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Right after dhctl bootstrap the node-manager validating webhook is often
still unreachable, so CreateStaticNodeGroup raced it and failed the whole
run with a transient "failed calling webhook ... connect: operation not
permitted" / InternalError. Wrap the existence-check + create in
retry.DoVoid with backoff (bounded by NodeGroupTimeout) and keep it
idempotent by re-reading the NodeGroup each attempt.

Also bump NodeGroupTimeout 2m -> 4m (now a retry budget) and
SecretsWaitTimeout 2m -> 10m, since secret materialization and webhook
convergence routinely exceed 2m on slower/nested clusters.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin self-assigned this Jun 15, 2026
@AleksZimin AleksZimin changed the title Add elastic feat: Elastic CR e2e helper layer for sds-elastic suites (retire raw-Rook helpers) Jun 15, 2026
@AleksZimin AleksZimin marked this pull request as ready for review June 15, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant