Skip to content

PoC: cloud-checksum task hashing for the global cache#7154

Draft
jorgee wants to merge 2 commits into
global-cache-playground-concurrencyfrom
global-cache-playground-concurrency-cloud-checksum
Draft

PoC: cloud-checksum task hashing for the global cache#7154
jorgee wants to merge 2 commits into
global-cache-playground-concurrencyfrom
global-cache-playground-concurrency-cloud-checksum

Conversation

@jorgee
Copy link
Copy Markdown
Contributor

@jorgee jorgee commented May 18, 2026

Summary

Experimental implementation of cloud-storage-native content checksums as the file-content contribution to the Nextflow task hash, layered on top of the existing global-cache prototype (NXF_GLOBALCACHE_PATH). Enables cross-bucket, cross-region task cache reuse without changing the workdir layout.

Builds on extending nf-cloudcache as global-cache work. Targets the global-cache-playground-concurrency branch.

Motivation

The existing prototype on global-cache-playground-concurrency zeroes sessionId and drops processName from the task hash, enabling cross-pipeline reuse — but file inputs are still hashed via HashMode.STANDARD (path + size + mtime). The same content at two different paths produces two different task hashes, so cross-bucket / cross-region cache reuse never materialises.

This PR replaces the file-content contribution with the cloud-storage-native checksum (S3 SHA256, Azure MD5), making the task hash truly content-addressable for cloud-resident files.

Design

Capability-interface pattern, mirroring the existing FileSystemTransferAware:

  • nextflow.util.checksum.Checksum (nf-commons) — value object {algo, value} with a canonical toHashContribution() rendering as "algo:value".
  • nextflow.file.ChecksumAwareFileSystemProvider (nf-commons) — capability interface implemented by cloud FileSystemProviders that can return a content-addressable checksum via a single HEAD-style metadata call.
  • CacheHelper.HashMode.CLOUD_HASH — new enum value. Parsed via "cloud-hash" / "cloud_hash".
  • FileHolder.funnel — when mode is CLOUD_HASH, routes to storePath (workdir-side staged location) rather than sourceObj, then delegates to HashBuilder.hashFile.
  • HashBuilder.hashFile — handles CLOUD_HASH for regular files (single HEAD) and directories (walk + commutative sum, mirroring hashDirSha256). Files / dirs without a usable native checksum fall back to STANDARD path+size+mtime with a distinct WARN log per cause.
  • TaskHasher — overrides the configured hash mode to CLOUD_HASH automatically when global-cache mode is active (session.uniqueId == 0000…0000).

Cloud-side implementations:

  • S3FileSystemProvider (nf-amazon) implements headChecksum, reading checksumSHA256() from a HeadObject with ChecksumMode.ENABLED.
  • AzFileSystemProvider (nf-azure) implements headChecksum, reading blobClient().getProperties().getContentMd5().
  • S3OutputStream, S3Client, S3BashLib — every Nextflow-initiated S3 upload now requests ChecksumAlgorithm.SHA256 (centralised via S3Client.CHECKSUM_ALGORITH) so objects Nextflow writes are reachable by headChecksum. aws s3 cp in the task wrapper gains --checksum-algorithm SHA256. s5cmd is left alone (the binary doesn't accept the flag).

Supporting fixes:

  • SerializationHelperkryo.setRegistrationRequired(false) + plugin-contributed serializers attached via addDefaultSerializer instead of positional register(...). Cached TaskContext blobs are now plugin-set-independent (previously, switching the loaded plugin set between cache write and read shifted Kryo numeric class IDs and triggered Encountered unregistered class ID: N).
  • SerializationHelper.getPluginAwareClassLoader — Kryo classloader that delegates to pf4j PluginClassLoaders so name-encoded class lookups can resolve plugin classes (S3Path, AzPath, etc.).
  • Session.init / ConfigBuilder — the bootstrap mkdirs of ${NXF_GLOBALCACHE_PATH}/cache/${sessionId} moves out of ConfigBuilder (too early for Azure: AzPathFactory.parseUri reads session.config.azure, which isn't set yet) into Session.init (after Global.setSession).

Known limitations

  • Pre-existing S3 objects without an x-amz-checksum-* header (e.g. uploaded before SHA256 was the configured default, or uploaded via tools that don't request additional checksums) fall back to STANDARD path+size+mtime. Backfilling them requires a CopyObject --checksum-algorithm SHA256 pass for buckets you own, or the S3 Batch Operations "Compute checksum" job for read-only public buckets — not implemented here.
  • Default S3 checksum varies by API surface. The AWS console uses CRC64NVMe by default, the AWS CLI defaults to CRC32, and other SDKs vary. Nextflow now forces SHA256 on every write it controls (SDK uploads, S3OutputStream, aws s3 cp from S3BashLib), so objects Nextflow produces converge on one algorithm. Objects produced outside Nextflow (manual uploads, external tools) may carry a different additional-checksum algorithm and won't be reachable by headChecksum until backfilled.
  • GCS support is deferred. GCS uses Google's CloudStorageFileSystemProvider, which can't be extended with the capability interface directly. In practice this is currently not a limitation for the global cache: GCS inputs are foreign-scheme and get ported into the workdir by FilePorter before TaskHasher runs, so they participate in cloud-hash via their workdir-side copy under the workdir's native algorithm.
  • Multi-provider Fusion / direct-cloud-access scenarios — if Fusion or similar mechanisms let a task read directly from a different cloud without staging through FilePorter, the file would not be reachable via the workdir's ChecksumAwareFileSystemProvider and would fall back to STANDARD. This is a forward-looking limitation rather than a present one.
  • s5cmd cp does not accept --checksum-algorithm — deployments using s5cmd rely on whatever default that binary applies. Documented as a comment in AwsBatchExecutor.s3Cmd.
  • Directory inputs are walked and per-file hashes are summed commutatively. Wide directories (fastqc/*/, multiqc/) issue one HEAD per file at hash time — acceptable for an experiment, worth memoising on the FS provider if measurements call for it.

Reviewer notes

  • Targets global-cache-playground-concurrency (not master) — this is experimental, not for direct master merge.
  • Each fall-through case in HashBuilder.lookupCloudChecksum emits a distinct WARN so it's easy to see in .nextflow.log why a particular input isn't getting cross-bucket hits.
  • The change to SerializationHelper is broader than just the global cache (affects all Kryo use in Nextflow). The behaviour is more permissive: previously-unregistered classes are now name-encoded rather than throwing. This needs careful review before any landing on master.
  • S3OutputStream retains the existing Content-MD5 integrity header alongside the new checksumAlgorithm(SHA256). They're independent and complementary.

jorgee added 2 commits May 15, 2026 16:53
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee changed the title Experiment: cloud-checksum task hashing for the global cache PoC: cloud-checksum task hashing for the global cache May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant