PoC: cloud-checksum task hashing for the global cache#7154
Draft
jorgee wants to merge 2 commits into
Draft
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Experimental implementation of cloud-storage-native content checksums as the file-content contribution to the Nextflow task hash, layered on top of the existing global-cache prototype (
NXF_GLOBALCACHE_PATH). Enables cross-bucket, cross-region task cache reuse without changing the workdir layout.Builds on extending
nf-cloudcacheas global-cache work. Targets theglobal-cache-playground-concurrencybranch.Motivation
The existing prototype on
global-cache-playground-concurrencyzeroessessionIdand dropsprocessNamefrom the task hash, enabling cross-pipeline reuse — but file inputs are still hashed viaHashMode.STANDARD(path + size + mtime). The same content at two different paths produces two different task hashes, so cross-bucket / cross-region cache reuse never materialises.This PR replaces the file-content contribution with the cloud-storage-native checksum (S3 SHA256, Azure MD5), making the task hash truly content-addressable for cloud-resident files.
Design
Capability-interface pattern, mirroring the existing
FileSystemTransferAware:nextflow.util.checksum.Checksum(nf-commons) — value object{algo, value}with a canonicaltoHashContribution()rendering as"algo:value".nextflow.file.ChecksumAwareFileSystemProvider(nf-commons) — capability interface implemented by cloudFileSystemProviders that can return a content-addressable checksum via a single HEAD-style metadata call.CacheHelper.HashMode.CLOUD_HASH— new enum value. Parsed via"cloud-hash"/"cloud_hash".FileHolder.funnel— when mode isCLOUD_HASH, routes tostorePath(workdir-side staged location) rather thansourceObj, then delegates toHashBuilder.hashFile.HashBuilder.hashFile— handlesCLOUD_HASHfor regular files (single HEAD) and directories (walk + commutative sum, mirroringhashDirSha256). Files / dirs without a usable native checksum fall back to STANDARD path+size+mtime with a distinct WARN log per cause.TaskHasher— overrides the configured hash mode toCLOUD_HASHautomatically when global-cache mode is active (session.uniqueId == 0000…0000).Cloud-side implementations:
S3FileSystemProvider(nf-amazon) implementsheadChecksum, readingchecksumSHA256()from aHeadObjectwithChecksumMode.ENABLED.AzFileSystemProvider(nf-azure) implementsheadChecksum, readingblobClient().getProperties().getContentMd5().S3OutputStream,S3Client,S3BashLib— every Nextflow-initiated S3 upload now requestsChecksumAlgorithm.SHA256(centralised viaS3Client.CHECKSUM_ALGORITH) so objects Nextflow writes are reachable byheadChecksum.aws s3 cpin the task wrapper gains--checksum-algorithm SHA256.s5cmdis left alone (the binary doesn't accept the flag).Supporting fixes:
SerializationHelper—kryo.setRegistrationRequired(false)+ plugin-contributed serializers attached viaaddDefaultSerializerinstead of positionalregister(...). CachedTaskContextblobs are now plugin-set-independent (previously, switching the loaded plugin set between cache write and read shifted Kryo numeric class IDs and triggeredEncountered unregistered class ID: N).SerializationHelper.getPluginAwareClassLoader— Kryo classloader that delegates to pf4jPluginClassLoaders so name-encoded class lookups can resolve plugin classes (S3Path,AzPath, etc.).Session.init/ConfigBuilder— the bootstrap mkdirs of${NXF_GLOBALCACHE_PATH}/cache/${sessionId}moves out ofConfigBuilder(too early for Azure:AzPathFactory.parseUrireadssession.config.azure, which isn't set yet) intoSession.init(afterGlobal.setSession).Known limitations
x-amz-checksum-*header (e.g. uploaded before SHA256 was the configured default, or uploaded via tools that don't request additional checksums) fall back to STANDARD path+size+mtime. Backfilling them requires aCopyObject --checksum-algorithm SHA256pass for buckets you own, or the S3 Batch Operations "Compute checksum" job for read-only public buckets — not implemented here.S3OutputStream,aws s3 cpfromS3BashLib), so objects Nextflow produces converge on one algorithm. Objects produced outside Nextflow (manual uploads, external tools) may carry a different additional-checksum algorithm and won't be reachable byheadChecksumuntil backfilled.CloudStorageFileSystemProvider, which can't be extended with the capability interface directly. In practice this is currently not a limitation for the global cache: GCS inputs are foreign-scheme and get ported into the workdir byFilePorterbeforeTaskHasherruns, so they participate in cloud-hash via their workdir-side copy under the workdir's native algorithm.FilePorter, the file would not be reachable via the workdir'sChecksumAwareFileSystemProviderand would fall back to STANDARD. This is a forward-looking limitation rather than a present one.s5cmd cpdoes not accept--checksum-algorithm— deployments using s5cmd rely on whatever default that binary applies. Documented as a comment inAwsBatchExecutor.s3Cmd.fastqc/*/,multiqc/) issue one HEAD per file at hash time — acceptable for an experiment, worth memoising on the FS provider if measurements call for it.Reviewer notes
global-cache-playground-concurrency(notmaster) — this is experimental, not for direct master merge.HashBuilder.lookupCloudChecksumemits a distinct WARN so it's easy to see in.nextflow.logwhy a particular input isn't getting cross-bucket hits.SerializationHelperis broader than just the global cache (affects all Kryo use in Nextflow). The behaviour is more permissive: previously-unregistered classes are now name-encoded rather than throwing. This needs careful review before any landing onmaster.S3OutputStreamretains the existingContent-MD5integrity header alongside the newchecksumAlgorithm(SHA256). They're independent and complementary.