feat(deploy): set_replica_identity for PostgreSQL CDC (#1447)#1466
feat(deploy): set_replica_identity for PostgreSQL CDC (#1447)#1466dimitri-yatsenko wants to merge 2 commits into
Conversation
Adds dj.deploy.set_replica_identity(target, mode, dry_run) — an idempotent
deployment-time operation that applies ALTER TABLE ... REPLICA IDENTITY
DEFAULT|FULL across a Schema or single Table on PostgreSQL.
Required by some CDC consumers (Databricks Lakehouse Sync mandates FULL;
silently skips tables without it). The ALTER is metadata-only and instant;
the cost is in WAL volume on subsequent UPDATE/DELETE.
Why a new module rather than dj.migrate or auto-emit in declare():
- Not a migration. Nothing legacy is being fixed; this configures an
existing schema for an environment-specific consumer requirement.
- Not a declare-time concern. Mixing auto-emit-on-declare with a separate
utility for existing tables produces mixed state (new tables FULL, old
tables DEFAULT) until both run. Migration-only is one coherent path.
- Future operational helpers (publication membership, vacuum/reindex,
grants) belong with set_replica_identity, not in migrate.py. dj.deploy
is the home for that category.
Shape:
- adapters/postgres.py — replica_identity_ddl(full_name, mode) emits the
ALTER TABLE statement. Pure DDL emitter; "default" and "full" both
produce explicit ALTERs. Invalid mode raises DataJointError.
- deploy.py (new) — set_replica_identity dispatches Schema vs Table,
routes through the PG adapter, raises on non-PG backends, supports
dry_run, returns {tables_analyzed, tables_modified, ddl}.
- __init__.py — exports dj.deploy.
Tests:
- Unit tests for the adapter DDL (full, default, invalid mode).
- Unit tests for deploy.set_replica_identity (mode validation, target
validation, non-PG rejection, dry_run, apply, default mode, empty
schema). All use stub adapters so no live PG is required.
Slated for DataJoint 2.3.
MilagrosMarin
left a comment
There was a problem hiding this comment.
Read this carefully against dj.deploy.set_replica_identity and the spec in datajoint-docs#180. Mechanics are clean, the dispatch ladder handles all three target shapes (Schema, Table class, Table instance), and the duck-typed hasattr(adapter, "replica_identity_ddl") PG check is a friendly contract for third-party PG-compatible adapters. A few observations, mostly suggestions — none are large fixes, but two of them feel worth catching before this lands.
Suggestions worth addressing
1. assert statements in the dispatch ladder might bite under -O. Four places in deploy.py (lines 123, 125, 130, 135):
assert connection is not None, "Schema has no active connection"
assert target.database is not None, "Schema is not activated"
assert connection is not None, "Table has no active connection" # ×2The project's CLAUDE.md flags this as a pattern to avoid — Python's -O flag strips asserts at bytecode-compile time. In a deploy-hook context that's not unusual (some Docker base images run -O; some CI orchestrators do). If -O strips these, the user gets AttributeError: 'NoneType' object has no attribute 'adapter' on the next line instead of the intended DataJointError. Lightly suggest converting to explicit if … raise DataJointError(…) — the messages are already there, just need the runtime guarantee. The Schema database is None case is the most concerning since it would continue with None as a schema name and produce a malformed ALTER TABLE None ....
2. The two Table-target branches have no test coverage. All seven tests in tests/unit/test_deploy.py build a _FakeSchema; the Table-class branch (deploy.py:127-132) and Table-instance branch (deploy.py:133-137) are never exercised. They use target.full_table_name (a different code path from the Schema branch's adapter.make_full_table_name(...)), so a regression in Table.full_table_name semantics — for example, the database is None case returning None — wouldn't surface here. A _FakeTable test for each branch with the existing _FakeAdapter would close the gap cheaply.
Lighter polish — feel free to defer
3. Test count in the PR description says "16 new unit tests" — I count 7 in test_deploy.py + 3 new ones in test_adapters.py = 10. Trivial fix; mostly mentioning it so reviewers reading the description aren't searching for tests that aren't there.
4. Type hints are looser than the spec. Spec in datajoint-docs#180 documents the signature as target: Schema | Table, mode: Literal["default", "full"]. Impl uses target: Any, mode: str — runtime validation does the same work, but the tighter types would help IDEs and type-checkers. Literal["default", "full"] is straightforward; Schema | Table would need a forward reference for the circular import.
5. mode="FULL" (uppercase) is rejected on a strict equality. The PG adapter does mode.upper() internally so it's case-tolerant; the deploy function isn't. Could be mode.lower() not in (...) for symmetry — minor UX nit, not worth a separate PR.
6. Worth one line on lock behavior in the docstring's Cost section. ALTER TABLE ... REPLICA IDENTITY takes an AccessExclusiveLock for the brief metadata flip. On an actively-ingesting table, that can stack behind in-flight transactions and block briefly — a real consideration for deploy-hook timing on a busy production table. Something like "requires a brief exclusive lock — will block behind in-flight writes" would round out the existing Cost framing.
7. Partial-failure semantics on multi-table runs aren't explicit. If connection.query(ddl) raises on table N of M, the first N-1 tables are already modified, but the caller doesn't see the dict — the exception propagates. Re-running brings the rest into compliance (idempotent), but a one-sentence note in the docstring would make that explicit. Not load-bearing.
8. Minor style nit: line 143 has adjacent-string concatenation (f"... adapter " "does not ...") — either one f-string or one continuous string would read more naturally. Cosmetic only.
Verified ✅ (no action needed)
- DDL format matches
replica_identity_ddl()adapter method exactly (ALTER TABLE {full_name} REPLICA IDENTITY {MODE}) Schema.list_tables()already excludes~-prefixed hidden tables and~~job tables (the CDC-relevant filter)dj.Schemainherits from_Schema—isinstance(target, _Schema)correctly catches the public API- Schema dispatch (
adapter.make_full_table_name(...)) and Table dispatch (Table.full_table_nameproperty, which internally calls the same adapter method) produce equivalent quoting dj.deploycorrectly exported via__init__.py("deploy"in__all__+from . import deploy)hasattr(adapter, "replica_identity_ddl")is a defensible PG-detection strategy that lets third-party PG-compatible adapters opt in- Both
defaultandfullemit explicit ALTERs — correct semantics for reverting fromFULL - The
dj.migratesibling functions cited in the module docstring (migrate_columns,add_job_metadata_columns,rebuild_lineage) all exist inmigrate.py - Error messages match datajoint-docs#180's spec verbatim
dry_run=Truedefault is a thoughtful safety choice
Marking as request-changes because of items #1 and #2 — both are small mechanical fixes but they're the kind of thing better caught now than in a 2.3.1 patch.
Summary
Adds
dj.deploy.set_replica_identity(target, mode, dry_run)— an idempotent deployment-time operation that appliesALTER TABLE ... REPLICA IDENTITY DEFAULT|FULLacross a Schema or single Table on PostgreSQL.Closes #1447. Slated for DataJoint 2.3.
Why
dj.deployand notdj.migrateor auto-emit indeclare()The original issue proposed two mechanisms — a config flag applied automatically during
declare(), plus a utility for existing tables. After implementing both, we reverted to a single migration-only path. Reasons:dj.migrateis for one-shot schema/state evolution; this is operational configuration.FULL, old tablesDEFAULT) until both run. One mechanism is coherent.set_replica_identityis the first of a category: publication membership, vacuum/reindex, role grants, table-level autovacuum params. These belong together, not inmigrate.py. The newdj.deploymodule is the right home.What's in the PR
src/datajoint/adapters/postgres.pyreplica_identity_ddl(full_name, mode)— pure DDL emitter; bothdefaultandfullproduce explicit ALTERs; invalid mode raises.src/datajoint/deploy.py(new)set_replica_identity(target, mode, dry_run)— dispatches Schema/Table, validates backend, supports dry_run, returns{tables_analyzed, tables_modified, ddl}.src/datajoint/__init__.pydj.deploy.tests/unit/test_adapters.pytests/unit/test_deploy.py(new)User-facing API
Non-PostgreSQL backends raise
DataJointError(loud failure rather than silent no-op — consistent with this being an opt-in operational tool).Test plan
tests/unit/test_deploy.py+ 3 new tests intests/unit/test_adapters.py)tests/unit/test_settings.py(80 tests) andtests/integration/test_declare.py(44 tests) green — no regressions from the revertdatajoint-docs(in flight)