feat: serialize tenant-owned writes against tenant data purges by xernobyl · Pull Request #87 · formbricks/hub

xernobyl · 2026-06-12T16:42:06Z

What does this PR do?

Adds a deliberate tenant-write serialization layer so the tenant data purge
(DELETE /v1/tenants/{tenant_id}/data) can no longer race concurrent
tenant-owned writes for the same tenant. Previously the purge deleted in one
transaction but did not serialize against in-flight writes, so a concurrent
create/update could commit mid-purge and survive it.

Closes ENG-1013

Mechanism — Postgres transaction-scoped advisory locks, single source of
truth in internal/repository/tenant_write_lock.go:

Every tenant-owned mutation (feedback records, webhooks, embeddings, taxonomy)
takes a shared lock on hashtextextended('tenant_write|<len>:<tenant_id>', 0).
Shared mode keeps same-tenant writers fully concurrent.
The purge takes the same key exclusively, waiting up to
TENANT_PURGE_LOCK_TIMEOUT_SECONDS (default 5) for in-flight writers to drain.
Writers use try-locks: the moment a purge is queued, new same-tenant writes
fail fast; other tenants are unaffected. Both directions return a retryable
409 tenant_write_conflict (new RFC 9457 problem code, distinct from the
terminal duplicate-record conflict).
ID-based mutations resolve the row's tenant inside the transaction and mutate
with a tenant-scoped WHERE; tenant-changing webhook updates lock both old and
new tenant; the GDPR cross-tenant delete locks every spanned tenant.
Creates (the hot path) gate the insert on the lock in a single autocommit
statement (INSERT ... SELECT ... WHERE pg_try_advisory_xact_lock_shared(...)),
so they serialize in one round trip with the same isolation.

No schema migration (advisory locks are not schema). Since the issue was filed,
taxonomy added six tenant-owned tables; those write paths are covered too.

Scope grew to cover review feedback: deduped the resolve+lock sequence and the
tx/interface helpers, and floored the purge lock_timeout to ≥1ms so it can
never disable the timeout.

API behavior change — affected endpoints now document a retryable 409. Example body:

{
  "type": "https://hub.formbricks.com/problems/tenant-write-conflict",
  "title": "Conflict",
  "status": 409,
  "detail": "tenant data purge in progress for this tenant; retry later",
  "instance": "/v1/feedback-records",
  "code": "tenant_write_conflict",
  "request_id": "019ebcb4-c817-7582-8017-cbc52258e368"
}

Accepted, documented tails: webhook deliveries already in flight at purge commit
complete; already-enqueued dispatch jobs no-op post-purge but their payloads
persist until River retention pruning.

How should this be tested?

Config: integration tests need a Postgres+pgvector test_db (the API serves on
PORT, tests use DATABASE_URL, API_KEY).

make dev-setup           # or: docker compose up -d postgres && make init-db && make river-migrate
make build
make fmt && make lint    # 0 issues
make test-unit           # no DB
make tests               # integration, real Postgres (CI repeats on pg16/17/18)
make tests-coverage      # 54.8% (threshold 15%)

ENG-1013 integration coverage in tests/tenant_write_lock_test.go — each holds
the advisory lock directly on a raw connection to make the race deterministic:

TestTenantPurgeWaitsForInFlightWritesThenConflicts — purge waits then 409s; succeeds after release
TestTenantWritesRejectedDuringPurge — every same-tenant write → 409 (incl. gated creates); other tenants → 201; no events published for rejected writes
TestTenantWritesFailFastWhilePurgeQueued — pg_locks-verified: new write rejected while purge is queued; queued purge completes after writers drain
TestGDPRDeleteByUserDuringPurge — all-or-nothing 409 across tenants; tenant-scoped delete proceeds
TestWebhookTenantMoveLocksTargetTenant, TestRepositoryWritesConflictDuringPurge

API contract: build, run, and make schemathesis (or the API-contract workflow)
— status_code_conformance / response_schema_conformance pass with the new 409s
and tenant_write_conflict enum member.

Checklist

Required

Appreciated

If API changed: added/updated OpenAPI spec and ran contract tests
If API behavior changed: added request/response examples to this PR
Updated docs in docs/ if changes were necessary — N/A (endpoint docs are in openapi.yaml)
Ran make tests-coverage for meaningful logic changes

Tenant data purges now take an exclusive transaction-scoped advisory lock per tenant while every tenant-owned mutation (feedback records, webhooks, embeddings, taxonomy) holds the same key in shared try-lock mode inside a single repository-level helper. Writes for a tenant under purge fail fast with a retryable 409 (code tenant_write_conflict); the purge waits up to TENANT_PURGE_LOCK_TIMEOUT_SECONDS (default 5s) for in-flight writes to drain and then returns the same retryable conflict. Other tenants are unaffected. ID-based mutations now resolve the row's tenant inside the transaction and mutate with tenant-scoped WHERE clauses; the GDPR delete-by-user locks every spanned tenant; tenant-changing webhook updates lock both tenants. Workers skip or retry cleanly on conflicts. Closes ENG-1013 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T16:42:42Z

✱ Stainless preview builds

This PR will update the hub SDKs with the following commit message.

feat: serialize tenant-owned writes against tenant data purges (ENG-1013)

✅ hub-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

✅ hub-typescript studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅ → build ✅ → lint ✅ → test ✅

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-16 11:25:37 UTC

coderabbitai · 2026-06-12T16:56:22Z

Walkthrough

This PR introduces tenant-level write serialization using Postgres transaction-scoped advisory locks to prevent race conditions between tenant data purges and concurrent mutations. A new TenantWriteConflictError type maps to HTTP 409 Conflict responses. The configuration system parses TENANT_PURGE_LOCK_TIMEOUT_SECONDS (defaulting to 5 seconds). A core withTenantWriteTx helper wraps mutations in tenant-scoped transactions that acquire shared locks; purges acquire exclusive locks that block new writes and conflict when lock timeouts expire. Repository mutations across feedback records, webhooks, embeddings, and taxonomy are refactored to use the locking pattern. Async workers handle conflicts by retrying or treating them as temporary unavailability. OpenAPI documentation and a comprehensive integration test suite document and validate the locking behavior.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.19% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive and well-structured, following the template with all required sections properly filled out.
Title check	✅ Passed	The title clearly summarizes the main change: implementing serialization of tenant-owned writes against tenant data purges via advisory locks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/workers/webhook_dispatch.go (1)

165-201: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Only emit disable-success signals after the disable write succeeds.

RecordWebhookDisabled and the "webhook disabled after max delivery attempts" log still run on the new tenant-conflict path, even though Lines 183-188 explicitly skip the update during a purge. That inflates disable metrics and produces a false success log for a webhook that was never disabled.

♻️ Suggested change

 	if isLastAttempt {
 		if w.metrics != nil {
-			w.metrics.RecordWebhookDisabled(ctx, "max_attempts")
 			w.metrics.RecordDelivery(ctx, args.EventType, "failed_final")
 			w.metrics.RecordWebhookDeliveryDuration(ctx, time.Since(start), args.EventType, "failed_final")
 		}
@@
 		switch {
 		case updateErr == nil:
+			if w.metrics != nil {
+				w.metrics.RecordWebhookDisabled(ctx, "max_attempts")
+			}
+
+			slog.Error("webhook disabled after max delivery attempts",
+				"webhook_id", webhook.ID,
+				"event_id", args.EventID,
+				"error", err,
+			)
 		case errors.Is(updateErr, huberrors.ErrTenantWriteConflict):
 			// A tenant data purge is deleting this webhook; disabling it is moot.
 			slog.Warn("webhook dispatch: disable skipped, tenant purge in progress",
 				"webhook_id", webhook.ID,
@@
-		slog.Error("webhook disabled after max delivery attempts",
-			"webhook_id", webhook.ID,
-			"event_id", args.EventID,
-			"error", err,
-		)
-
 		return fmt.Errorf("webhook send (final attempt): %w", err)
 	}

Based on PR objectives, webhook dispatch should swallow disable-conflicts during purges while still reporting delivery failures.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/workers/webhook_dispatch.go` around lines 165 - 201, The code
currently records a disable metric and emits the "webhook disabled after max
delivery attempts" log even when repo.Update(webhook.ID, ...) was skipped due to
a tenant purge conflict; move the calls to w.metrics.RecordWebhookDisabled(...)
and the final slog.Error("webhook disabled after max delivery attempts", ...)
into the successful-update branch (i.e., only when updateErr == nil). Keep
recording delivery failure metrics (w.metrics.RecordDelivery /
RecordWebhookDeliveryDuration) and emitting the failure log/error when updateErr
is a tenant write conflict or other error, but do not signal "disabled" nor call
RecordWebhookDisabled unless the Update call actually succeeded.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/repository/feedback_records_repository.go`:
- Around line 455-486: DeleteByUser currently snapshots tenantIDs via
listUserFeedbackTenants before acquiring locks, allowing new tenant writes to
appear after that snapshot and escape deletion; modify DeleteByUser (involving
withTenantWriteTx, listUserFeedbackTenants, tryLockTenantsShared and the
tenantIDs variable) to detect tenant-set drift and retry or fail: after
acquiring locks (or using a user-scoped lock), re-query the tenant set and
compare to the original tenantIDs—if they differ, either expand the deletion to
include the new tenants or return a retryable error so the caller can retry;
alternatively acquire a global/user-scoped lock before listing tenants so the
tenant set cannot change between listUserFeedbackTenants and the DELETE. Ensure
the chosen approach returns a retryable conflict when drift is detected and that
returned DeletedFeedbackRecordsByTenant accurately reflects what was deleted.

In `@internal/repository/taxonomy_repository.go`:
- Around line 985-1002: The function getNodeForUpdate currently accepts a
generic queryer which allows non-transactional calls; change its third parameter
type from queryer to the tx-only interface (e.g., tenantWriteTx) so the compiler
forces callers to pass an explicit tenant-scoped transaction; update the
function signature and any callers (including RenameNode and RemoveNode) to
open/forward the tenantWriteTx created by the tenant write lock helper, and
adjust imports/types as needed so the row lock acquired by the SELECT ... FOR
UPDATE is held for the life of the transaction.

In `@internal/repository/tenant_data_repository.go`:
- Around line 76-81: The deferred rollback handling must ignore spurious
connection-closed errors when acquireTenantPurgeLock cancels the request ctx:
update the cleanup path that calls dbTx.Rollback(ctx) (the rollback invoked
after calling acquireTenantPurgeLock and anywhere dbTx is rolled back on error)
to either call Rollback with a detached context (e.g., context.Background()) or
to treat pgconn.ErrConnClosed (and pgx.ErrTxClosed) as non-fatal and swallow
them instead of logging them as failures; reference acquireTenantPurgeLock,
dbTx.Rollback(ctx), purgeLockTimeout and tenantID to locate the affected
transaction cleanup code and adjust the error handling to ignore these specific
connection-closed/tx-closed errors.

In `@internal/repository/tenant_write_lock.go`:
- Around line 156-160: The acquireTenantPurgeLock helper must validate the
timeout before calling set_config: reject or normalize non-positive durations
(timeout <= 0) and return an error immediately instead of passing
timeout.Milliseconds() to Postgres; update acquireTenantPurgeLock to check the
timeout parameter and return a clear error for zero/negative values prior to
calling exec.Exec(`SELECT set_config('lock_timeout', $1, true)`, ...), and add
regression tests covering timeout == 0 and negative durations to assert the
function fails fast rather than disabling or passing an invalid lock_timeout.

In `@openapi.yaml`:
- Around line 1068-1076: Split the ambiguous 409 response into two distinct 409
cases (distinguished by the problem detail "code" field) so each has the correct
remediation: one for the purge conflict (code tenant_purge_in_progress) whose
description instructs the client to retry after the purge completes and mentions
a Retry-After header or waiting for completion, and a second for concurrent
modification (code tenant_write_conflict) whose description instructs the client
to re-fetch the resource and retry the operation or surface the conflict to the
user (do not tell clients to wait for purge completion); update both response
entries that reference ErrorModel (the 409 block shown and the similar block at
the second location) so the problem detail "code" values and guidance are
explicit and accurate.

In `@tests/tenant_write_lock_test.go`:
- Around line 365-374: The pg_locks check in the require.Eventually block is too
broad and can match any ungranted advisory lock; narrow it to only observe the
purge for tenantA by filtering the query with
repository.TenantWriteLockKey(tenantA) (or by selecting the purge session PID)
so the count only considers the advisory lock key for
TenantWriteLockKey(tenantA). Update the SQL inside the eventually closure (the
QueryRow call) to include the appropriate condition using
TenantWriteLockKey(tenantA) and adjust the Scan/params accordingly so the
waiting > 0 assertion only reflects tenantA's purge lock.

---

Outside diff comments:
In `@internal/workers/webhook_dispatch.go`:
- Around line 165-201: The code currently records a disable metric and emits the
"webhook disabled after max delivery attempts" log even when
repo.Update(webhook.ID, ...) was skipped due to a tenant purge conflict; move
the calls to w.metrics.RecordWebhookDisabled(...) and the final
slog.Error("webhook disabled after max delivery attempts", ...) into the
successful-update branch (i.e., only when updateErr == nil). Keep recording
delivery failure metrics (w.metrics.RecordDelivery /
RecordWebhookDeliveryDuration) and emitting the failure log/error when updateErr
is a tenant write conflict or other error, but do not signal "disabled" nor call
RecordWebhookDisabled unless the Update call actually succeeded.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f32ab145-0c96-4e94-8fe3-14d791ba970e

📥 Commits

Reviewing files that changed from the base of the PR and between 4d165fb and 2a8c061.

📒 Files selected for processing (24)

.env.example
AGENTS.md
cmd/api/app.go
internal/api/response/errors.go
internal/api/response/problem.go
internal/api/response/response_test.go
internal/config/config.go
internal/config/config_test.go
internal/huberrors/errors.go
internal/repository/embeddings_repository.go
internal/repository/feedback_records_repository.go
internal/repository/taxonomy_repository.go
internal/repository/tenant_data_repository.go
internal/repository/tenant_data_repository_test.go
internal/repository/tenant_write_lock.go
internal/repository/tenant_write_lock_test.go
internal/repository/webhooks_repository.go
internal/workers/feedback_embedding.go
internal/workers/feedback_embedding_test.go
internal/workers/webhook_dispatch.go
internal/workers/webhook_dispatch_test.go
openapi.yaml
tests/integration_test.go
tests/tenant_write_lock_test.go

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

chore: consolidate tenant write lock helpers and interfaces

b923d1f

xernobyl changed the title ~~feat: serialize tenant-owned writes against tenant data purges (ENG-1013)~~ feat: serialize tenant-owned writes against tenant data purges Jun 15, 2026

chore: address PR comments

e777b59

xernobyl requested a review from BhagyaAmarasinghe June 15, 2026 12:25

BhagyaAmarasinghe reviewed Jun 16, 2026

View reviewed changes

Comment thread internal/repository/feedback_records_repository.go

BhagyaAmarasinghe reviewed Jun 16, 2026

View reviewed changes

Comment thread openapi.yaml Outdated

chore: address pr issues

e15124e

xernobyl closed this Jun 16, 2026

xernobyl deleted the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 09:30

xernobyl restored the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 09:33

xernobyl reopened this Jun 16, 2026

xernobyl requested a review from BhagyaAmarasinghe June 16, 2026 09:38

BhagyaAmarasinghe approved these changes Jun 16, 2026

View reviewed changes

BhagyaAmarasinghe added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit 4141d58 Jun 16, 2026
22 of 23 checks passed

BhagyaAmarasinghe deleted the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: serialize tenant-owned writes against tenant data purges#87

feat: serialize tenant-owned writes against tenant data purges#87
BhagyaAmarasinghe merged 4 commits into
mainfrom
tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges

xernobyl commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xernobyl commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How should this be tested?

Checklist

Required

Appreciated

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xernobyl commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading