feat: Elastic CR e2e helper layer for sds-elastic suites (retire raw-Rook helpers)#25
Open
AleksZimin wants to merge 4 commits into
Open
feat: Elastic CR e2e helper layer for sds-elastic suites (retire raw-Rook helpers)#25AleksZimin wants to merge 4 commits into
AleksZimin wants to merge 4 commits into
Conversation
Introduce the sds-elastic-layer helpers the e2e suite builds on: - pkg/kubernetes/elasticcluster.go and elasticstorageclass.go: low-level create/wait/delete + condition/topology readers over the cluster-scoped storage.deckhouse.io/v1alpha1 EC/ESC CRs (unstructured + dynamic client). - pkg/kubernetes/elasticrook.go: readiness verifiers for the renamed Rook group internal.sdselastic.deckhouse.io/v1 plus a discovery helper to assert the upstream ceph.rook.io group is absent. - pkg/kubernetes/blockdevice.go: LabelBlockDevice to mark disks for OSD adoption via an ElasticCluster blockDeviceSelector. - pkg/testkit/elastic.go: high-level EnsureElasticCluster/EnsureElasticStorageClass, Teardown* (with optional force-deletion), and EnsureElasticOSDBlockDevices. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
The csi-ceph-era raw-Rook builders (Ceph cluster/pool/storageclass/ connection/auth/credentials) are superseded by the dynamic-client ElasticCluster/ElasticStorageClass helpers, which drive the same Ceph backend through the sds-elastic CRDs. Drop the orphaned builders and their testkit wrappers; relocate DefaultRookNamespace to the kept rookconfigoverride.go (still used by ceph_crc.go) and refresh the functions glossary. The CephFilesystem/CRC/config-override daemon helpers are retained for the Rook-daemon stress paths. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Add ApplyModulePullOverrideEnv so each module's modulePullOverride can be
overridden at config load via a per-module env var (sds-elastic ->
SDS_ELASTIC_MODULE_PULL_OVERRIDE, csi-ceph -> CSI_CEPH_MODULE_PULL_OVERRIDE;
name upper-cased, non-[A-Z0-9] -> _). This lets CI pin the module-under-test
to a pr<N>/mr<N> image without editing the committed cluster_config.yml,
while the YAML stays literal and ${VAR} inside modulePullOverride remains
rejected. A single global tag is intentionally avoided so multi-module
configs are unambiguous. Each applied override is logged at load time,
naming both the static tag and the env-var tag that wins.
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Right after dhctl bootstrap the node-manager validating webhook is often still unreachable, so CreateStaticNodeGroup raced it and failed the whole run with a transient "failed calling webhook ... connect: operation not permitted" / InternalError. Wrap the existence-check + create in retry.DoVoid with backoff (bounded by NodeGroupTimeout) and keep it idempotent by re-reading the NodeGroup each attempt. Also bump NodeGroupTimeout 2m -> 4m (now a retry budget) and SecretsWaitTimeout 2m -> 10m, since secret materialization and webhook convergence routinely exceed 2m on slower/nested clusters. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a high-level Elastic CR helper layer to storage-e2e so module suites
(starting with
sds-elastic) drive the module through its publicstorage.deckhouse.io/v1alpha1API —ElasticCluster/ElasticStorageClass—instead of poking the vendored Rook and csi-ceph CRs directly. The old
raw-Rook/Ceph helper set is retired in the same change (net −787 lines).
It also carries the two supporting pieces needed to actually run those suites
on freshly bootstrapped clusters: a per-module image-tag override and a
retrying NodeGroup create.
What's included
Elastic CR helper layer (new)
pkg/kubernetes/elasticcluster.go,pkg/kubernetes/elasticstorageclass.go—create/wait/get/delete helpers for
ElasticClusterandElasticStorageClass,including readiness polling on their conditions.
pkg/testkit/elastic.go— testkit-level builders/constants (replicationmodes, ESC types) that suites compose against.
pkg/kubernetes/elasticrook.go— read-only helpers to inspect the vendoredRook CRs (renamed
internal.sdselastic.deckhouse.iogroup) for assertions.pkg/kubernetes/blockdevice.go— label/select raw BlockDevices that back OSDs.Raw-Rook helpers removed
cephcluster.go,cephblockpool.go,cephclusterconnection.go,cephstorageclass.go,cephcredentials.goandpkg/testkit/{ceph,ceph_cluster}.go.Suites no longer manipulate Rook/csi-ceph CRs by hand; they go through the
module's own CRs.
Per-module
modulePullOverrideenv overrideinternal/config/overrides.go(+ tests): override a module'smodulePullOverrideat config load via a per-module env var(
sds-elastic→SDS_ELASTIC_MODULE_PULL_OVERRIDE,csi-ceph→CSI_CEPH_MODULE_PULL_OVERRIDE; name upper-cased, non-[A-Z0-9]→_).Lets CI pin the module-under-test to a
pr<N>/mr<N>image without editingthe committed
cluster_config.yml. A single global tag is intentionallyavoided so multi-module configs stay unambiguous; each applied override is
logged at load time.
NodeGroup create robustness
pkg/kubernetes/nodegroup.go: wrapCreateStaticNodeGroupin a boundedretry (idempotent re-read each attempt). Right after
dhctl bootstrapthenode-manager validating webhook is often still unreachable, so a single-shot
create raced it and failed the whole run with a transient
failed calling webhook ... connect: operation not permitted/InternalError.internal/config/config.go:NodeGroupTimeout2m → 4m (now a retry budget),SecretsWaitTimeout2m → 10m (bootstrap secrets/webhook convergence routinelyexceed 2m on slower/nested clusters).
Docs
docs/{ARCHITECTURE,FUNCTIONS_GLOSSARY,WORKLOG}.mdandREADME.mdupdated toreflect the new helper layer, the
<MODULE>_MODULE_PULL_OVERRIDEcontract,and the timeout/retry behavior.
Compatibility / notes
intended replacement is the Elastic CR layer above.
sds-elastice2e suite) pins this branch as apseudo-version until it is merged and tagged.
Test plan
go build ./...,go vet ./...cleango test ./internal/config/...(override normalization / precedence) passessds-elastice2e suite (nested cluster: EC bootstrap, RBD/CephFS/HighRedundancy ESC round-trip, deletion guards, module-disable guard) runs green against this branch