HDDS-15444. Adjusted Ratis client retry, timeout configs to improve failure responsiveness#10482
Merged
sumitagrawl merged 2 commits intoJun 12, 2026
Conversation
…ailure responsiveness
amaliujia
reviewed
Jun 11, 2026
| tags = { OZONE, CLIENT, PERFORMANCE }, | ||
| description = "Client's max retry value for the exponential backoff policy.") | ||
| private int exponentialPolicyMaxRetries = Integer.MAX_VALUE; | ||
| private int exponentialPolicyMaxRetries = 2; |
Contributor
There was a problem hiding this comment.
So all the changes combined theoretically, for write requests:
- max total retry time widow would be 2 * max_sleep * 1.5 = 2 * 5 * 1.5 = 15 second.
- min total retry time window would be 1 * 2 * 0.5 + 1 * 4 * 0.5 = 3 second.
Just curious if the 3 to 15 seconds time window is enough to decide if the leader is not reachable after the initial timeout?
Contributor
There was a problem hiding this comment.
Oh Sorry I might be confused myself:
I guess what retries here control when to retry and there is another request timeout. So update the number as:
- max total retry time widow would be 2 * max_sleep * 1.5 + 2 * write_request_time_out = 2 * 5 * 1.5 + 2 * 70 = 155 seconds.
- min total retry time window would be 1 * 2 * 0.5 + 1 * 4 * 0.5 + 2 * write_request_time_out = 143 second.
After the initial timeout, there are around 143 to 155 seconds before giving up, which seems to be sufficient.
…-configuration-retry
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Some of the configuration values for the ratis client are too long such that they make the client appear non-responsive.
This jira updates some of the configurations to make the client be more responsive to failures on the server side
Exponential Backoff (for TimeoutIOException on writes):
Multilinear Random Retry (for generic/other exceptions):
Watch Timeout (waiting for ALL_COMMITTED replication):
Write Timeout (overall budget for write retries):
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15444
How was this patch tested?
CI: https://github.com/ptlrs/ozone/actions/runs/27291093969