Skip to content

feat: add RateLimitLayer middleware#567

Open
Linuxdazhao wants to merge 3 commits into
64bit:mainfrom
Linuxdazhao:feat/rate-limit-layer
Open

feat: add RateLimitLayer middleware#567
Linuxdazhao wants to merge 3 commits into
64bit:mainfrom
Linuxdazhao:feat/rate-limit-layer

Conversation

@Linuxdazhao

Copy link
Copy Markdown
Contributor

接着 #320 的讨论,我把 RPM 限流层做了出来。

放在 retry 下面,governor 做 RPM 限流,同时从 x-ratelimit-remaining-requests 和 x-ratelimit-reset-requests 头读服务端的限流状态。remaining 到 0 的时候 poll_ready 会卡住等到 reset 时间再放行,避免发多余的请求。governor 桶和背压状态在 clone 之间共享,retry 的时候也走同一个桶。

native 上用 tokio::sleep 做等待,sleep future 存在 service 里防止 waker 丢掉;wasm 上只走 governor 本地计数,不做延迟。

TPM 限流需要从响应体提取 token 用量,后面单独做。

开了 rate-limit feature 才生效,不影响现有用户。13 个单测覆盖了 header 解析、背压状态、waker 注册、governor 桶共享、重试穿过限流层这些情况。

@Linuxdazhao Linuxdazhao reopened this Jun 30, 2026
Uses governor for RPM limiting, placed below the retry layer so retries
are also throttled. Reads x-ratelimit-remaining-requests and
x-ratelimit-reset-requests headers to apply backpressure when the server
quota is exhausted. Gated behind the rate-limit feature.
@Linuxdazhao Linuxdazhao force-pushed the feat/rate-limit-layer branch from 462e867 to bbdbfef Compare June 30, 2026 03:47
Linuxdazhao added 2 commits June 30, 2026 14:01
The ServerBackpressure struct, its methods, and helper functions are
only used by the non-wasm Service impl. On wasm32 they became dead code,
failing the build under -D warnings. Gate them with
#[cfg(not(target_family = "wasm"))] and add a wasm-only empty ZST
variant so the field type still compiles.
@64bit

64bit commented Jul 1, 2026

Copy link
Copy Markdown
Owner

接着 #320 的讨论,我把 RPM 限流层做了出来。

放在 retry 下面,governor 做 RPM 限流,同时从 x-ratelimit-remaining-requests 和 x-ratelimit-reset-requests 头读服务端的限流状态。remaining 到 0 的时候 poll_ready 会卡住等到 reset 时间再放行,避免发多余的请求。governor 桶和背压状态在 clone 之间共享,retry 的时候也走同一个桶。

native 上用 tokio::sleep 做等待,sleep future 存在 service 里防止 waker 丢掉;wasm 上只走 governor 本地计数,不做延迟。

TPM 限流需要从响应体提取 token 用量,后面单独做。

开了 rate-limit feature 才生效,不影响现有用户。13 个单测覆盖了 header 解析、背压状态、waker 注册、governor 桶共享、重试穿过限流层这些情况。


Translation:

Following up on the discussion in #320, I have implemented the RPM rate-limiting layer.

Positioned beneath the retry layer, the governor component handles RPM rate limiting while simultaneously reading rate-limit status from the x-ratelimit-remaining-requests and x-ratelimit-reset-requests headers. When the remaining request count hits zero, poll_ready pauses until the reset time is reached before allowing requests to proceed, thereby avoiding unnecessary traffic. The governor bucket and backpressure state are shared across clones, ensuring that retries utilize the same bucket.

On native platforms, tokio::sleep is used for waiting, with the sleep future stored within the service to prevent the waker from being dropped; on WASM, only local governor counting is used, without introducing delays.

TPM rate limiting requires extracting token usage from the response body and will be implemented separately.

This feature is enabled only when the rate-limit feature flag is active, ensuring no impact on existing users. Thirteen unit tests cover scenarios including header parsing, backpressure state, waker registration, governor bucket sharing, and retries passing through the rate-limiting layer.

@64bit

64bit commented Jul 1, 2026

Copy link
Copy Markdown
Owner

谢谢。

希望能做几处修改:

  1. 建议将此功能保留在现有的 middleware 功能标志(feature flag)下。
  2. 我认为在这里支持 WASM 并不合理,因为 WASM 环境缺乏 sleep 功能;此外,从实现逻辑来看,请求会绕过当前层直接转发给内部服务,因此我倾向于仅针对非 WASM 目标启用此限流层(RateLimitLayer)。
  3. RateLimitLayer 应支持底层 Quota 所支持的其他时间单位(如每秒、每分钟等)——或许可以直接接收 NonZeroU32 类型,而不是 u32

Translation:

Thanks for this.

Would appreciate a few changes:

  1. Lets keep this under existing feature flag middleware
  2. I dont think WASM support makes sense here because of absense of sleep, also from implementation it looks like service bypasses to inner service, so I'm in favour of keep this rate limit layer only for the non-wasm target.
  3. RateLimitLayer should support other time units that are supported by underlying Quota, per second, per minute etc - perhaps taking NonZeroU32 directly instead of u32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants