Skip to content

feat: serialize tenant-owned writes against tenant data purges#87

Merged
BhagyaAmarasinghe merged 4 commits into
mainfrom
tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges
Jun 16, 2026
Merged

feat: serialize tenant-owned writes against tenant data purges#87
BhagyaAmarasinghe merged 4 commits into
mainfrom
tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges

Conversation

@xernobyl

@xernobyl xernobyl commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a deliberate tenant-write serialization layer so the tenant data purge
(DELETE /v1/tenants/{tenant_id}/data) can no longer race concurrent
tenant-owned writes for the same tenant. Previously the purge deleted in one
transaction but did not serialize against in-flight writes, so a concurrent
create/update could commit mid-purge and survive it.

Closes ENG-1013

Mechanism — Postgres transaction-scoped advisory locks, single source of
truth in internal/repository/tenant_write_lock.go:

  • Every tenant-owned mutation (feedback records, webhooks, embeddings, taxonomy)
    takes a shared lock on hashtextextended('tenant_write|<len>:<tenant_id>', 0).
    Shared mode keeps same-tenant writers fully concurrent.
  • The purge takes the same key exclusively, waiting up to
    TENANT_PURGE_LOCK_TIMEOUT_SECONDS (default 5) for in-flight writers to drain.
  • Writers use try-locks: the moment a purge is queued, new same-tenant writes
    fail fast; other tenants are unaffected. Both directions return a retryable
    409 tenant_write_conflict (new RFC 9457 problem code, distinct from the
    terminal duplicate-record conflict).
  • ID-based mutations resolve the row's tenant inside the transaction and mutate
    with a tenant-scoped WHERE; tenant-changing webhook updates lock both old and
    new tenant; the GDPR cross-tenant delete locks every spanned tenant.
  • Creates (the hot path) gate the insert on the lock in a single autocommit
    statement (INSERT ... SELECT ... WHERE pg_try_advisory_xact_lock_shared(...)),
    so they serialize in one round trip with the same isolation.

No schema migration (advisory locks are not schema). Since the issue was filed,
taxonomy added six tenant-owned tables; those write paths are covered too.

Scope grew to cover review feedback: deduped the resolve+lock sequence and the
tx/interface helpers, and floored the purge lock_timeout to ≥1ms so it can
never disable the timeout.

API behavior change — affected endpoints now document a retryable 409. Example body:

{
  "type": "https://hub.formbricks.com/problems/tenant-write-conflict",
  "title": "Conflict",
  "status": 409,
  "detail": "tenant data purge in progress for this tenant; retry later",
  "instance": "/v1/feedback-records",
  "code": "tenant_write_conflict",
  "request_id": "019ebcb4-c817-7582-8017-cbc52258e368"
}

Accepted, documented tails: webhook deliveries already in flight at purge commit
complete; already-enqueued dispatch jobs no-op post-purge but their payloads
persist until River retention pruning.

How should this be tested?

Config: integration tests need a Postgres+pgvector test_db (the API serves on
PORT, tests use DATABASE_URL, API_KEY).

make dev-setup           # or: docker compose up -d postgres && make init-db && make river-migrate
make build
make fmt && make lint    # 0 issues
make test-unit           # no DB
make tests               # integration, real Postgres (CI repeats on pg16/17/18)
make tests-coverage      # 54.8% (threshold 15%)

ENG-1013 integration coverage in tests/tenant_write_lock_test.go — each holds
the advisory lock directly on a raw connection to make the race deterministic:

  • TestTenantPurgeWaitsForInFlightWritesThenConflicts — purge waits then 409s; succeeds after release
  • TestTenantWritesRejectedDuringPurge — every same-tenant write → 409 (incl. gated creates); other tenants → 201; no events published for rejected writes
  • TestTenantWritesFailFastWhilePurgeQueuedpg_locks-verified: new write rejected while purge is queued; queued purge completes after writers drain
  • TestGDPRDeleteByUserDuringPurge — all-or-nothing 409 across tenants; tenant-scoped delete proceeds
  • TestWebhookTenantMoveLocksTargetTenant, TestRepositoryWritesConflictDuringPurge

API contract: build, run, and make schemathesis (or the API-contract workflow)
status_code_conformance / response_schema_conformance pass with the new 409s
and tenant_write_conflict enum member.

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read [Repository Guidelines](https://github.com/formbricks/hub/blob/main/AGENTS.md)
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand bits
  • Ran make build
  • Ran make tests (integration tests in tests/)
  • Ran make fmt and make lint; no new warnings
  • Removed debug prints / temporary logging
  • Merged the latest changes from main onto my branch with git pull origin main
  • If database schema changed: added migration — N/A, advisory locks are not schema

Appreciated

  • If API changed: added/updated OpenAPI spec and ran contract tests
  • If API behavior changed: added request/response examples to this PR
  • Updated docs in docs/ if changes were necessary — N/A (endpoint docs are in openapi.yaml)
  • Ran make tests-coverage for meaningful logic changes

Tenant data purges now take an exclusive transaction-scoped advisory lock
per tenant while every tenant-owned mutation (feedback records, webhooks,
embeddings, taxonomy) holds the same key in shared try-lock mode inside a
single repository-level helper. Writes for a tenant under purge fail fast
with a retryable 409 (code tenant_write_conflict); the purge waits up to
TENANT_PURGE_LOCK_TIMEOUT_SECONDS (default 5s) for in-flight writes to
drain and then returns the same retryable conflict. Other tenants are
unaffected. ID-based mutations now resolve the row's tenant inside the
transaction and mutate with tenant-scoped WHERE clauses; the GDPR
delete-by-user locks every spanned tenant; tenant-changing webhook
updates lock both tenants. Workers skip or retry cleanly on conflicts.

Closes ENG-1013

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

✱ Stainless preview builds

This PR will update the hub SDKs with the following commit message.

feat: serialize tenant-owned writes against tenant data purges (ENG-1013)
hub-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

hub-typescript studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-16 11:25:37 UTC

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR introduces tenant-level write serialization using Postgres transaction-scoped advisory locks to prevent race conditions between tenant data purges and concurrent mutations. A new TenantWriteConflictError type maps to HTTP 409 Conflict responses. The configuration system parses TENANT_PURGE_LOCK_TIMEOUT_SECONDS (defaulting to 5 seconds). A core withTenantWriteTx helper wraps mutations in tenant-scoped transactions that acquire shared locks; purges acquire exclusive locks that block new writes and conflict when lock timeouts expire. Repository mutations across feedback records, webhooks, embeddings, and taxonomy are refactored to use the locking pattern. Async workers handle conflicts by retrying or treating them as temporary unavailability. OpenAPI documentation and a comprehensive integration test suite document and validate the locking behavior.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and well-structured, following the template with all required sections properly filled out.
Title check ✅ Passed The title clearly summarizes the main change: implementing serialization of tenant-owned writes against tenant data purges via advisory locks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/workers/webhook_dispatch.go (1)

165-201: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Only emit disable-success signals after the disable write succeeds.

RecordWebhookDisabled and the "webhook disabled after max delivery attempts" log still run on the new tenant-conflict path, even though Lines 183-188 explicitly skip the update during a purge. That inflates disable metrics and produces a false success log for a webhook that was never disabled.

♻️ Suggested change
 	if isLastAttempt {
 		if w.metrics != nil {
-			w.metrics.RecordWebhookDisabled(ctx, "max_attempts")
 			w.metrics.RecordDelivery(ctx, args.EventType, "failed_final")
 			w.metrics.RecordWebhookDeliveryDuration(ctx, time.Since(start), args.EventType, "failed_final")
 		}
@@
 		switch {
 		case updateErr == nil:
+			if w.metrics != nil {
+				w.metrics.RecordWebhookDisabled(ctx, "max_attempts")
+			}
+
+			slog.Error("webhook disabled after max delivery attempts",
+				"webhook_id", webhook.ID,
+				"event_id", args.EventID,
+				"error", err,
+			)
 		case errors.Is(updateErr, huberrors.ErrTenantWriteConflict):
 			// A tenant data purge is deleting this webhook; disabling it is moot.
 			slog.Warn("webhook dispatch: disable skipped, tenant purge in progress",
 				"webhook_id", webhook.ID,
@@
-		slog.Error("webhook disabled after max delivery attempts",
-			"webhook_id", webhook.ID,
-			"event_id", args.EventID,
-			"error", err,
-		)
-
 		return fmt.Errorf("webhook send (final attempt): %w", err)
 	}

Based on PR objectives, webhook dispatch should swallow disable-conflicts during purges while still reporting delivery failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/workers/webhook_dispatch.go` around lines 165 - 201, The code
currently records a disable metric and emits the "webhook disabled after max
delivery attempts" log even when repo.Update(webhook.ID, ...) was skipped due to
a tenant purge conflict; move the calls to w.metrics.RecordWebhookDisabled(...)
and the final slog.Error("webhook disabled after max delivery attempts", ...)
into the successful-update branch (i.e., only when updateErr == nil). Keep
recording delivery failure metrics (w.metrics.RecordDelivery /
RecordWebhookDeliveryDuration) and emitting the failure log/error when updateErr
is a tenant write conflict or other error, but do not signal "disabled" nor call
RecordWebhookDisabled unless the Update call actually succeeded.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/repository/feedback_records_repository.go`:
- Around line 455-486: DeleteByUser currently snapshots tenantIDs via
listUserFeedbackTenants before acquiring locks, allowing new tenant writes to
appear after that snapshot and escape deletion; modify DeleteByUser (involving
withTenantWriteTx, listUserFeedbackTenants, tryLockTenantsShared and the
tenantIDs variable) to detect tenant-set drift and retry or fail: after
acquiring locks (or using a user-scoped lock), re-query the tenant set and
compare to the original tenantIDs—if they differ, either expand the deletion to
include the new tenants or return a retryable error so the caller can retry;
alternatively acquire a global/user-scoped lock before listing tenants so the
tenant set cannot change between listUserFeedbackTenants and the DELETE. Ensure
the chosen approach returns a retryable conflict when drift is detected and that
returned DeletedFeedbackRecordsByTenant accurately reflects what was deleted.

In `@internal/repository/taxonomy_repository.go`:
- Around line 985-1002: The function getNodeForUpdate currently accepts a
generic queryer which allows non-transactional calls; change its third parameter
type from queryer to the tx-only interface (e.g., tenantWriteTx) so the compiler
forces callers to pass an explicit tenant-scoped transaction; update the
function signature and any callers (including RenameNode and RemoveNode) to
open/forward the tenantWriteTx created by the tenant write lock helper, and
adjust imports/types as needed so the row lock acquired by the SELECT ... FOR
UPDATE is held for the life of the transaction.

In `@internal/repository/tenant_data_repository.go`:
- Around line 76-81: The deferred rollback handling must ignore spurious
connection-closed errors when acquireTenantPurgeLock cancels the request ctx:
update the cleanup path that calls dbTx.Rollback(ctx) (the rollback invoked
after calling acquireTenantPurgeLock and anywhere dbTx is rolled back on error)
to either call Rollback with a detached context (e.g., context.Background()) or
to treat pgconn.ErrConnClosed (and pgx.ErrTxClosed) as non-fatal and swallow
them instead of logging them as failures; reference acquireTenantPurgeLock,
dbTx.Rollback(ctx), purgeLockTimeout and tenantID to locate the affected
transaction cleanup code and adjust the error handling to ignore these specific
connection-closed/tx-closed errors.

In `@internal/repository/tenant_write_lock.go`:
- Around line 156-160: The acquireTenantPurgeLock helper must validate the
timeout before calling set_config: reject or normalize non-positive durations
(timeout <= 0) and return an error immediately instead of passing
timeout.Milliseconds() to Postgres; update acquireTenantPurgeLock to check the
timeout parameter and return a clear error for zero/negative values prior to
calling exec.Exec(`SELECT set_config('lock_timeout', $1, true)`, ...), and add
regression tests covering timeout == 0 and negative durations to assert the
function fails fast rather than disabling or passing an invalid lock_timeout.

In `@openapi.yaml`:
- Around line 1068-1076: Split the ambiguous 409 response into two distinct 409
cases (distinguished by the problem detail "code" field) so each has the correct
remediation: one for the purge conflict (code tenant_purge_in_progress) whose
description instructs the client to retry after the purge completes and mentions
a Retry-After header or waiting for completion, and a second for concurrent
modification (code tenant_write_conflict) whose description instructs the client
to re-fetch the resource and retry the operation or surface the conflict to the
user (do not tell clients to wait for purge completion); update both response
entries that reference ErrorModel (the 409 block shown and the similar block at
the second location) so the problem detail "code" values and guidance are
explicit and accurate.

In `@tests/tenant_write_lock_test.go`:
- Around line 365-374: The pg_locks check in the require.Eventually block is too
broad and can match any ungranted advisory lock; narrow it to only observe the
purge for tenantA by filtering the query with
repository.TenantWriteLockKey(tenantA) (or by selecting the purge session PID)
so the count only considers the advisory lock key for
TenantWriteLockKey(tenantA). Update the SQL inside the eventually closure (the
QueryRow call) to include the appropriate condition using
TenantWriteLockKey(tenantA) and adjust the Scan/params accordingly so the
waiting > 0 assertion only reflects tenantA's purge lock.

---

Outside diff comments:
In `@internal/workers/webhook_dispatch.go`:
- Around line 165-201: The code currently records a disable metric and emits the
"webhook disabled after max delivery attempts" log even when
repo.Update(webhook.ID, ...) was skipped due to a tenant purge conflict; move
the calls to w.metrics.RecordWebhookDisabled(...) and the final
slog.Error("webhook disabled after max delivery attempts", ...) into the
successful-update branch (i.e., only when updateErr == nil). Keep recording
delivery failure metrics (w.metrics.RecordDelivery /
RecordWebhookDeliveryDuration) and emitting the failure log/error when updateErr
is a tenant write conflict or other error, but do not signal "disabled" nor call
RecordWebhookDisabled unless the Update call actually succeeded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f32ab145-0c96-4e94-8fe3-14d791ba970e

📥 Commits

Reviewing files that changed from the base of the PR and between 4d165fb and 2a8c061.

📒 Files selected for processing (24)
  • .env.example
  • AGENTS.md
  • cmd/api/app.go
  • internal/api/response/errors.go
  • internal/api/response/problem.go
  • internal/api/response/response_test.go
  • internal/config/config.go
  • internal/config/config_test.go
  • internal/huberrors/errors.go
  • internal/repository/embeddings_repository.go
  • internal/repository/feedback_records_repository.go
  • internal/repository/taxonomy_repository.go
  • internal/repository/tenant_data_repository.go
  • internal/repository/tenant_data_repository_test.go
  • internal/repository/tenant_write_lock.go
  • internal/repository/tenant_write_lock_test.go
  • internal/repository/webhooks_repository.go
  • internal/workers/feedback_embedding.go
  • internal/workers/feedback_embedding_test.go
  • internal/workers/webhook_dispatch.go
  • internal/workers/webhook_dispatch_test.go
  • openapi.yaml
  • tests/integration_test.go
  • tests/tenant_write_lock_test.go

Comment thread internal/repository/feedback_records_repository.go
Comment thread internal/repository/taxonomy_repository.go
Comment thread internal/repository/tenant_data_repository.go
Comment thread internal/repository/tenant_write_lock.go Outdated
Comment thread openapi.yaml
Comment thread tests/tenant_write_lock_test.go Outdated
@xernobyl xernobyl changed the title feat: serialize tenant-owned writes against tenant data purges (ENG-1013) feat: serialize tenant-owned writes against tenant data purges Jun 15, 2026
Comment thread internal/repository/feedback_records_repository.go
Comment thread openapi.yaml Outdated
@xernobyl xernobyl closed this Jun 16, 2026
@xernobyl xernobyl deleted the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 09:30
@xernobyl xernobyl restored the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 09:33
@xernobyl xernobyl reopened this Jun 16, 2026
@BhagyaAmarasinghe BhagyaAmarasinghe added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit 4141d58 Jun 16, 2026
22 of 23 checks passed
@BhagyaAmarasinghe BhagyaAmarasinghe deleted the tiago/eng-1013-serialize-tenant-owned-writes-during-hub-tenant-data-purges branch June 16, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants