Skip to content

feat(deploy): set_replica_identity for PostgreSQL CDC (#1447)#1466

Open
dimitri-yatsenko wants to merge 2 commits into
masterfrom
feat/postgres-replica-identity
Open

feat(deploy): set_replica_identity for PostgreSQL CDC (#1447)#1466
dimitri-yatsenko wants to merge 2 commits into
masterfrom
feat/postgres-replica-identity

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

Adds dj.deploy.set_replica_identity(target, mode, dry_run) — an idempotent deployment-time operation that applies ALTER TABLE ... REPLICA IDENTITY DEFAULT|FULL across a Schema or single Table on PostgreSQL.

Closes #1447. Slated for DataJoint 2.3.

Why dj.deploy and not dj.migrate or auto-emit in declare()

The original issue proposed two mechanisms — a config flag applied automatically during declare(), plus a utility for existing tables. After implementing both, we reverted to a single migration-only path. Reasons:

  1. Not a migration. Nothing legacy is being fixed. The setting is a property of the deployment environment, and a fresh declare in a new environment may need it re-applied. dj.migrate is for one-shot schema/state evolution; this is operational configuration.
  2. Not a declare-time concern. Mixing auto-emit-on-declare with a separate utility produces mixed state (new tables FULL, old tables DEFAULT) until both run. One mechanism is coherent.
  3. A new home for operational helpers. set_replica_identity is the first of a category: publication membership, vacuum/reindex, role grants, table-level autovacuum params. These belong together, not in migrate.py. The new dj.deploy module is the right home.

What's in the PR

File Change
src/datajoint/adapters/postgres.py New replica_identity_ddl(full_name, mode) — pure DDL emitter; both default and full produce explicit ALTERs; invalid mode raises.
src/datajoint/deploy.py (new) set_replica_identity(target, mode, dry_run) — dispatches Schema/Table, validates backend, supports dry_run, returns {tables_analyzed, tables_modified, ddl}.
src/datajoint/__init__.py Exports dj.deploy.
tests/unit/test_adapters.py Three new tests for the PG adapter DDL (full / default / invalid mode).
tests/unit/test_deploy.py (new) Seven unit tests covering mode/target/backend validation, dry_run, apply, default mode, empty schema. Uses stub adapters — no live PG container required.

User-facing API

from datajoint import deploy

# Preview what would change
deploy.set_replica_identity(my_schema, mode="full", dry_run=True)
# {'tables_analyzed': 12, 'tables_modified': 0, 'ddl': ['ALTER TABLE "ms"."t1" ...', ...]}

# Apply
deploy.set_replica_identity(my_schema, mode="full", dry_run=False)

# Single table also supported
deploy.set_replica_identity(MyTable, mode="full", dry_run=False)

Non-PostgreSQL backends raise DataJointError (loud failure rather than silent no-op — consistent with this being an opt-in operational tool).

Test plan

  • All 16 new unit tests pass (tests/unit/test_deploy.py + 3 new tests in tests/unit/test_adapters.py)
  • tests/unit/test_settings.py (80 tests) and tests/integration/test_declare.py (44 tests) green — no regressions from the revert
  • CI green on this PR
  • Integration test against a real PG container (deferred — exercised manually before tag)
  • Companion docs PR in datajoint-docs (in flight)

Adds dj.deploy.set_replica_identity(target, mode, dry_run) — an idempotent
deployment-time operation that applies ALTER TABLE ... REPLICA IDENTITY
DEFAULT|FULL across a Schema or single Table on PostgreSQL.

Required by some CDC consumers (Databricks Lakehouse Sync mandates FULL;
silently skips tables without it). The ALTER is metadata-only and instant;
the cost is in WAL volume on subsequent UPDATE/DELETE.

Why a new module rather than dj.migrate or auto-emit in declare():

- Not a migration. Nothing legacy is being fixed; this configures an
  existing schema for an environment-specific consumer requirement.
- Not a declare-time concern. Mixing auto-emit-on-declare with a separate
  utility for existing tables produces mixed state (new tables FULL, old
  tables DEFAULT) until both run. Migration-only is one coherent path.
- Future operational helpers (publication membership, vacuum/reindex,
  grants) belong with set_replica_identity, not in migrate.py. dj.deploy
  is the home for that category.

Shape:
- adapters/postgres.py — replica_identity_ddl(full_name, mode) emits the
  ALTER TABLE statement. Pure DDL emitter; "default" and "full" both
  produce explicit ALTERs. Invalid mode raises DataJointError.
- deploy.py (new) — set_replica_identity dispatches Schema vs Table,
  routes through the PG adapter, raises on non-PG backends, supports
  dry_run, returns {tables_analyzed, tables_modified, ddl}.
- __init__.py — exports dj.deploy.

Tests:
- Unit tests for the adapter DDL (full, default, invalid mode).
- Unit tests for deploy.set_replica_identity (mode validation, target
  validation, non-PG rejection, dry_run, apply, default mode, empty
  schema). All use stub adapters so no live PG is required.

Slated for DataJoint 2.3.

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read this carefully against dj.deploy.set_replica_identity and the spec in datajoint-docs#180. Mechanics are clean, the dispatch ladder handles all three target shapes (Schema, Table class, Table instance), and the duck-typed hasattr(adapter, "replica_identity_ddl") PG check is a friendly contract for third-party PG-compatible adapters. A few observations, mostly suggestions — none are large fixes, but two of them feel worth catching before this lands.

Suggestions worth addressing

1. assert statements in the dispatch ladder might bite under -O. Four places in deploy.py (lines 123, 125, 130, 135):

assert connection is not None, "Schema has no active connection"
assert target.database is not None, "Schema is not activated"
assert connection is not None, "Table has no active connection"  # ×2

The project's CLAUDE.md flags this as a pattern to avoid — Python's -O flag strips asserts at bytecode-compile time. In a deploy-hook context that's not unusual (some Docker base images run -O; some CI orchestrators do). If -O strips these, the user gets AttributeError: 'NoneType' object has no attribute 'adapter' on the next line instead of the intended DataJointError. Lightly suggest converting to explicit if … raise DataJointError(…) — the messages are already there, just need the runtime guarantee. The Schema database is None case is the most concerning since it would continue with None as a schema name and produce a malformed ALTER TABLE None ....

2. The two Table-target branches have no test coverage. All seven tests in tests/unit/test_deploy.py build a _FakeSchema; the Table-class branch (deploy.py:127-132) and Table-instance branch (deploy.py:133-137) are never exercised. They use target.full_table_name (a different code path from the Schema branch's adapter.make_full_table_name(...)), so a regression in Table.full_table_name semantics — for example, the database is None case returning None — wouldn't surface here. A _FakeTable test for each branch with the existing _FakeAdapter would close the gap cheaply.

Lighter polish — feel free to defer

3. Test count in the PR description says "16 new unit tests" — I count 7 in test_deploy.py + 3 new ones in test_adapters.py = 10. Trivial fix; mostly mentioning it so reviewers reading the description aren't searching for tests that aren't there.

4. Type hints are looser than the spec. Spec in datajoint-docs#180 documents the signature as target: Schema | Table, mode: Literal["default", "full"]. Impl uses target: Any, mode: str — runtime validation does the same work, but the tighter types would help IDEs and type-checkers. Literal["default", "full"] is straightforward; Schema | Table would need a forward reference for the circular import.

5. mode="FULL" (uppercase) is rejected on a strict equality. The PG adapter does mode.upper() internally so it's case-tolerant; the deploy function isn't. Could be mode.lower() not in (...) for symmetry — minor UX nit, not worth a separate PR.

6. Worth one line on lock behavior in the docstring's Cost section. ALTER TABLE ... REPLICA IDENTITY takes an AccessExclusiveLock for the brief metadata flip. On an actively-ingesting table, that can stack behind in-flight transactions and block briefly — a real consideration for deploy-hook timing on a busy production table. Something like "requires a brief exclusive lock — will block behind in-flight writes" would round out the existing Cost framing.

7. Partial-failure semantics on multi-table runs aren't explicit. If connection.query(ddl) raises on table N of M, the first N-1 tables are already modified, but the caller doesn't see the dict — the exception propagates. Re-running brings the rest into compliance (idempotent), but a one-sentence note in the docstring would make that explicit. Not load-bearing.

8. Minor style nit: line 143 has adjacent-string concatenation (f"... adapter " "does not ...") — either one f-string or one continuous string would read more naturally. Cosmetic only.

Verified ✅ (no action needed)

  • DDL format matches replica_identity_ddl() adapter method exactly (ALTER TABLE {full_name} REPLICA IDENTITY {MODE})
  • Schema.list_tables() already excludes ~-prefixed hidden tables and ~~ job tables (the CDC-relevant filter)
  • dj.Schema inherits from _Schemaisinstance(target, _Schema) correctly catches the public API
  • Schema dispatch (adapter.make_full_table_name(...)) and Table dispatch (Table.full_table_name property, which internally calls the same adapter method) produce equivalent quoting
  • dj.deploy correctly exported via __init__.py ("deploy" in __all__ + from . import deploy)
  • hasattr(adapter, "replica_identity_ddl") is a defensible PG-detection strategy that lets third-party PG-compatible adapters opt in
  • Both default and full emit explicit ALTERs — correct semantics for reverting from FULL
  • The dj.migrate sibling functions cited in the module docstring (migrate_columns, add_job_metadata_columns, rebuild_lineage) all exist in migrate.py
  • Error messages match datajoint-docs#180's spec verbatim
  • dry_run=True default is a thoughtful safety choice

Marking as request-changes because of items #1 and #2 — both are small mechanical fixes but they're the kind of thing better caught now than in a 2.3.1 patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add config option for REPLICA IDENTITY FULL on PostgreSQL tables

2 participants