Skip to content

HDDS-15444. Adjusted Ratis client retry, timeout configs to improve failure responsiveness#10482

Merged
sumitagrawl merged 2 commits into
apache:masterfrom
ptlrs:HDDS-15444-Client-configuration-retry
Jun 12, 2026
Merged

HDDS-15444. Adjusted Ratis client retry, timeout configs to improve failure responsiveness#10482
sumitagrawl merged 2 commits into
apache:masterfrom
ptlrs:HDDS-15444-Client-configuration-retry

Conversation

@ptlrs

@ptlrs ptlrs commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Some of the configuration values for the ratis client are too long such that they make the client appear non-responsive.

This jira updates some of the configurations to make the client be more responsive to failures on the server side

  • Exponential Backoff (for TimeoutIOException on writes):

    • Base sleep: 4s to 1s — faster initial retry
    • Max sleep cap: 40s to 5s — don't wait long between retries on a dead leader
    • Max retries: unlimited (Integer.MAX_VALUE) to 2 — if 2 retries fail, the leader is dead. We will let Ozone allocate a new pipeline
  • Multilinear Random Retry (for generic/other exceptions):

    • Policy: 5s×5, 10s×5, 15s×5, 20s×5, 25s×5, 60s×10 (~16 min, 35 retries) to 5s×6 (~30s total) — fail fast instead of hanging
  • Watch Timeout (waiting for ALL_COMMITTED replication):

    • Overall watch timeout: 3 min to 30s
    • Watch RPC timeout: 180s to 30s — aligned with server-side watch timeout (also 30s). There's no point waiting longer than the server
  • Write Timeout (overall budget for write retries):

    • Write request timeout: 5 min to 70s — enough for one RPC (60s) + buffer. This prevents retries from dragging on

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15444

How was this patch tested?

CI: https://github.com/ptlrs/ozone/actions/runs/27291093969

@yandrey321 yandrey321 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sumitagrawl sumitagrawl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl sumitagrawl marked this pull request as ready for review June 11, 2026 03:45
@sumitagrawl sumitagrawl marked this pull request as draft June 11, 2026 03:46
tags = { OZONE, CLIENT, PERFORMANCE },
description = "Client's max retry value for the exponential backoff policy.")
private int exponentialPolicyMaxRetries = Integer.MAX_VALUE;
private int exponentialPolicyMaxRetries = 2;

@amaliujia amaliujia Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all the changes combined theoretically, for write requests:

  1. max total retry time widow would be 2 * max_sleep * 1.5 = 2 * 5 * 1.5 = 15 second.
  2. min total retry time window would be 1 * 2 * 0.5 + 1 * 4 * 0.5 = 3 second.

Just curious if the 3 to 15 seconds time window is enough to decide if the leader is not reachable after the initial timeout?

@amaliujia amaliujia Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Sorry I might be confused myself:

I guess what retries here control when to retry and there is another request timeout. So update the number as:

  1. max total retry time widow would be 2 * max_sleep * 1.5 + 2 * write_request_time_out = 2 * 5 * 1.5 + 2 * 70 = 155 seconds.
  2. min total retry time window would be 1 * 2 * 0.5 + 1 * 4 * 0.5 + 2 * write_request_time_out = 143 second.

After the initial timeout, there are around 143 to 155 seconds before giving up, which seems to be sufficient.

@sumitagrawl sumitagrawl marked this pull request as ready for review June 12, 2026 04:38
@sumitagrawl sumitagrawl merged commit 42eedc2 into apache:master Jun 12, 2026
59 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants