Rate limits, concurrency, retries, and error recovery
Rate limits, concurrency, retries, and error recovery
This document explains how the extracted @github/copilot CLI bundle handles model rate limits, concurrent model work, retries, cancellation, and related error paths. It complements model-api-routing.md, which explains which network API shape is selected for each provider/model.
Executive summary
- Model API calls are wrapped by a CLI-owned retry loop. The observed model clients disable provider SDK retries with
maxRetries: 0at the call site or client construction point, so retry decisions are centralized in the CLI runtime. - Rate limits are handled in two layers: first by request retry/backoff, then by session-level recovery. Session recovery can switch to auto mode, prompt the UI to switch, or pause queued messages until the reset time.
- Model concurrency is not a single global HTTP throttle. A single session processes its message queue serially, while background subagents, fleet mode, background sessions, and MCP tasks can create parallel model calls.
- Parallel subagent execution is capped by a dedicated subagent concurrency limiter with account-plan defaults and
COPILOT_SUBAGENT_MAX_CONCURRENToverride support. - WebSocket Responses has special safety rails: the WebSocket adapter cannot be reused concurrently, falls back to HTTP Responses for early 429/500/503 WebSocket failures, and resets WebSocket state when conversation history is rewritten.
- Non-rate-limit recovery includes HTTP/2 transport reset, undici dispatcher reset, session-token refresh, request-size reduction, transient bad-request retry, streaming-error retry, quota classification, context-limit errors, tool-result normalization, and cancellation propagation.
Source anchors
app.js is bundled/minified, so semantic aliases below are analysis names. Minified anchors are version-specific lookup aids for the analyzed @github/copilot artifact and will shift across releases.
| Area | Semantic alias | Minified anchor | Approx. line | Role |
|---|---|---|---|---|
| Model request loop | ModelRequestLoop | U3.getCompletionWithTools(...) | app.js 3439 | Owns per-turn retry, tool loops, model success/failure events, and final failure construction. |
| Retry-delay resolver | resolveModelRetryDelay(...) | p3n(...) | app.js 3439 | Reads x-should-retry, retry-after-ms, retry-after, status codes, configured retry codes, exponential backoff, and jitter. |
| Default model retry options | initDefaultModelOptions(...) | U3.initDefaultOptions(...) | app.js 3439 | Sets default maxRetries: 5, default rate-limit wait, exponential growth, and maximum retry-after window. |
| Streaming transport retry processor | StreamingErrorRetryProcessor | _Ce | app.js 3103 | Retries non-API streaming errors after a short delay, excluding abort/timeout errors. |
| Content-annotation retry processor | SnippyProcessor | aft | app.js 3092 | Blocks tool execution and retries when Copilot annotations request a soft blocking pass. |
| SDK retry suppression | DisableSdkRetriesAtModelBoundary | maxRetries: 0 | app.js 3437, 3439, 3457, 3460 | Disables provider SDK retries for model clients/calls so the CLI retry loop controls policy. |
| Rate-limit classifier | classifyRateLimitError(...) | Xjn(...), qgr(...), zgr(...), eWn(...), SAr(...) | app.js 4161, 4252 | Converts 429 errors into user-facing rate-limit codes, messages, retry-after seconds, and auto-switch eligibility. |
| Session rate-limit recovery | SessionRateLimitRecovery | processQueuedItems(...) branch | app.js 4481, 4487 | Emits rate-limit errors, switches to auto mode when allowed, prompts the UI, or pauses pending messages. |
| Rate-limit pause timer | waitForRateLimitPause(...) | C8s(...), S8s | app.js 4471 | Waits for retry-after or a default 120-second pause, cancellable by the session abort signal. |
| Generic GitHub fetch retry | GitHubFetchWithRetry | Sy(...) | app.js 238 | Retries non-model GitHub API requests with Retry-After, exponential backoff, and circuit breaker integration. |
| Circuit breaker | HostCircuitBreaker | b1e, QQ | app.js 236-238 | Opens after repeated network/5xx failures, blocks requests briefly, then half-opens for a probe. |
| Session queue | SessionMessageQueue | ua.itemQueue, processQueue(...) | app.js 4475-4481 | Serializes normal messages in one session and defers idle while background agents are active. |
| Serial operation queue | SerialOperationQueue | nj | app.js 239 | Generic one-at-a-time async operation queue used by runtime services. |
| Task registry | TaskRegistry | B3 | app.js 3367 | Tracks background/sync agents, states, waiters, cancellation, completion, and active-time accounting. |
| Subagent concurrency limiter | SubagentConcurrencyLimiter | qmt | app.js 3698 | Caps concurrent subagent execution and releases slots on completion or idle wait. |
| WebSocket Responses fallback | WebSocketResponsesResilience | Pmt | app.js 3469-3472 | Prevents concurrent reuse, falls back to HTTP Responses on early WebSocket failures, and closes/reset state. |
| Quota message map | QuotaErrorMessages | Rii | app.js 191 | Maps 402 quota/billing errors to user-facing messages. |
| Usage-limit warning UI | UsageLimitWarningEmitter | wHo(...), THo(...) | app.js 6860 | Emits threshold warnings when weekly/session usage snapshots cross 50/75/90/95 percent. |
Model API retry flow
sequenceDiagram autonumber participant Session as Session runtime participant Loop as ModelRequestLoop participant Adapter as Provider adapter participant API as Model API participant Processors as Request processors
Session->>Loop: system prompt, messages, tools, abort signal Loop->>Processors: preRequest hooks Loop->>Adapter: send request with SDK retries disabled Adapter->>API: Chat / Responses / Messages / WebSocket API-->>Adapter: response or error Adapter-->>Loop: response, stream, or APIError alt success Loop-->>Session: model_call_success + message/tool events else retryable error Loop-->>Session: model_call_failure Loop->>Processors: onRequestError hooks Loop->>Loop: compute retry delay + jitter Loop-->>Session: turn_retry Loop->>Adapter: retry request else exhausted or non-retryable Loop-->>Session: turn_failed Loop-->>Session: throw classified error endThe observed default model retry policy is:
| Setting | Observed default | Meaning |
|---|---|---|
maxRetries | 5 | Maximum retry loop attempts before the model turn fails. |
defaultRetryAfterSeconds | 5 | Base wait when a 429 lacks an explicit reset delay. |
initialRetryAfterBackoffExtraSeconds | 1 | Extra exponential component for 429 backoff. |
retryAfterBackoffExtraGrowth | 2 | Exponential growth factor. |
maxRetryAfterSeconds | 180 | Upper bound; delays above this are treated as too long and the request gives up. |
COPILOT_AGENT_ERROR_CODES_TO_RETRY can add status codes or ranges to the model retry policy. The retry resolver also honors provider headers:
| Header/status | Handling |
|---|---|
x-should-retry: false | Do not retry, even if the status would normally be retryable. |
x-should-retry: true | Retry if a valid delay can be computed. |
retry-after-ms | Millisecond retry delay; converted to seconds in the model loop. |
retry-after | Parsed as seconds or HTTP date. |
408, 409, 429, 499 | Default retryable client/server-boundary statuses for model calls. |
>= 500 | Default retryable server-side statuses for model calls. |
| configured status/range | Extra retryable statuses from runtime settings/env. |
When no explicit reset delay is available, 429 uses a rate-limit-specific exponential delay; other retryable errors use a shorter exponential delay. Both paths add jitter so concurrent clients do not retry at exactly the same instant.
Rate-limit lifecycle
flowchart TD Error429["Model call receives HTTP 429"] --> RetryResolver["resolve retry delay"] RetryResolver -->|delay <= max| Retry["turn_retry after backoff"] Retry --> Error429 RetryResolver -->|exhausted or delay too long| Throw["throw model failure"] Throw --> Classifier["session error classifier"] Classifier --> Event["session.error: rate_limit"] Event --> AutoEligible{"eligible for auto switch?"} AutoEligible -->|yes + setting| AutoSwitch["switch to auto model"] AutoEligible -->|yes + UI listener| Prompt["auto_mode_switch.requested"] Prompt -->|yes| AutoSwitch Prompt -->|no| Pause AutoEligible -->|no| Pause["pause pending queue"] Pause --> Wait["Retry-After or 120 seconds"] Wait --> Continue["continue processing queued messages"]The rate-limit classifier recognizes several fine-grained codes when upstream includes them:
| Error code | User-facing meaning | Auto-mode switch suffix |
|---|---|---|
user_weekly_rate_limited | Weekly Copilot rate limit reached. | Can suggest waiting or switching to auto model. |
user_model_rate_limited | Selected model rate limit reached. | Suggests switching models or waiting. |
user_global_rate_limited | Session/global limit reached. | Not eligible for auto switch in the observed eligibility set. |
integration_rate_limited | Integration/model path rate limit reached. | Suggests switching models or waiting. |
rate_limited | Generic rate limit. | Can suggest waiting or switching to auto model. |
Auto-mode switching is only considered when all of these are true:
- the current selected model is not already auto mode;
- the session is not using a custom/BYOK provider;
- the error code is not excluded from auto-switch eligibility;
- the user setting says to continue on auto mode, or the UI accepts the
auto_mode_switch.requestedprompt.
If no switch happens and there are pending messages, the queue pauses for the upstream retry-after seconds. If upstream does not provide a reset delay, the default pause is 120 seconds. The pause is abortable through the session abort signal.
Model concurrency model
There are multiple concurrency layers; they intentionally solve different problems.
flowchart TD UserMessages["User/session messages"] --> SessionQueue["SessionMessageQueue"] SessionQueue -->|serial in one session| MainTurn["main agent model turn"]
MainTurn --> Tools["tool calls"] Tools --> TaskTool["task tool / subagents"] TaskTool --> Limiter["SubagentConcurrencyLimiter"] Limiter --> Agents["parallel subagent sessions"] Agents --> ModelCalls["parallel model API calls"]
MainTurn --> WebSocket["WebSocket Responses adapter"] WebSocket -->|one active loop| WsGuard["no concurrent reuse"] WebSocket --> Nested["nested prompt client for nested calls"]
BackgroundSessions["background sessions"] --> ModelCalls MCPTasks["MCP task streams"] --> AgentsWithin one session
Normal user messages are queued and processed serially. If a turn is already active, additional user messages are appended to itemQueue; immediate system notifications are stored separately and injected at the next request boundary. The session also defers session.idle while active background agents are running.
This means a single foreground session does not normally issue overlapping main-agent model turns. It can still create concurrent model traffic through background subagents and external task mechanisms.
Subagents and fleet mode
Subagent concurrency is controlled by a limiter:
| Account plan | Observed default concurrent subagent slots |
|---|---|
| Free / Edu | 2 |
| Pro / Pro Plus | 4 |
| Max | 8 |
| Business | 16 |
| Enterprise / default fallback | 32 |
COPILOT_SUBAGENT_MAX_CONCURRENT can override the limit, clamped to the range 1..256. A failed acquire returns a user-facing error such as “Maximum concurrent agent limit … reached.” Slots are released when an agent completes, fails, is cancelled, or enters an idle multi-turn wait. When an idle multi-turn agent receives a new message, it reacquires a slot before resuming.
Fleet mode does not introduce a separate model scheduler. It uses the same task tool and TaskRegistry, so fleet parallelism is bounded by the same subagent limiter and by upstream model/account rate limits.
WebSocket Responses concurrency
The WebSocket Responses adapter has a stricter local guard: one adapter instance cannot be reused for overlapping tool loops. If a nested prompt/model call is needed, it uses a nested HTTP Responses client rather than reusing the same WebSocket client. This avoids corrupting previous_response_id, incremental input state, and connection ownership.
Generic GitHub API retry and circuit breaker
Some non-model GitHub API calls use a separate fetch wrapper with retry and circuit breaker behavior.
stateDiagram-v2 [*] --> Closed Closed --> Closed: success or non-counted error Closed --> Open: repeated network/5xx failures Open --> Open: reject requests immediately Open --> HalfOpen: reset timeout elapsed HalfOpen --> Closed: probe succeeds HalfOpen --> Open: probe failsObserved behavior:
- retryable statuses include
429,500, and502for the generic GitHub fetch wrapper; - circuit-breaker-counted response statuses include
500,502,503, and504; - network/connect/timeout-style errors are counted by the circuit breaker;
- the default retry count is
5, default delay is5seconds, and backoff factor is2; - the circuit opens after repeated failures and rejects requests until its reset window elapses;
- if an HTTP response does not look like it originated from GitHub and lacks expected tracing headers, it is classified as a possible proxy/firewall interception.
This wrapper is separate from the model request loop. Do not assume every model API call goes through the generic GitHub fetch wrapper; the model adapters call SDK clients directly and then apply the model retry loop above.
Other recovery paths worth knowing
| Scenario | Runtime behavior |
|---|---|
401 session token expiry | The model loop can call a session-token refresh callback once, update the client token, and retry immediately. If refresh fails for auto mode, the session clears auto-mode state and asks the user to resend. |
402 quota/billing | No retry. The session emits session.error with errorType: "quota" and maps known codes such as quota_exceeded, session_quota_exceeded, and billing_not_configured to user-facing guidance. |
400 transient bad request | The model loop retries a small number of times with short backoff before treating it as a query error. |
413 request too large | The model loop removes images/native document attachments from older messages, emits binary_attachments_removed, and retries. |
| Context-window overflow | The session emits session.error with errorType: "context_limit" and suggests starting a new session or compacting. |
| HTTP/2 GOAWAY / undici pool assertion | The loop resets the global dispatcher, wraps the problem as a retryable transport error, and retries. |
| WebSocket Responses early 429/500/503 | If no meaningful streaming output has arrived, the WebSocket adapter disables that attempt, closes the connection, and falls back to HTTP Responses for the remaining retry path. |
| Missing Responses completion event | If accumulated output is substantive, the Responses adapter uses the snapshot as a fallback; otherwise it throws so the outer retry path can retry. |
| Non-API streaming error | A request processor retries after a short delay, except for abort and timeout errors. |
| User/session abort | Abort signals are passed into model requests, WebSocket waits, tool execution, and rate-limit pause timers. Aborts are not retried as ordinary failures. |
| Parent subagent cancellation | TaskRegistry recursively cancels child agents, aborts their controllers, resolves waiters, and marks tasks as cancelled. |
| Tool execution failure | Tool failures are converted into tool-result messages for the model when possible, with telemetry/error fields preserved. |
| Invalid tool-call JSON | Tool arguments can be cleared and returned to the model as a structured failure instead of crashing the runtime. |
| Copilot content annotations | A processor can block tool execution and retry; after max retries it soft-fails and continues. |
| Git command errors | Git errors are classified by type; known expected cases can set skipReport to avoid noisy crash reporting. |
Usage and quota warnings
The model success path reads quota/usage snapshot headers with prefixes such as x-quota-snapshot- and x-usage-ratelimit-. The TUI warning layer watches assistant usage events and emits ephemeral usage-limit warnings at these remaining/used thresholds:
- over 50 percent used;
- over 75 percent used;
- over 90 percent used;
- over 95 percent used.
The warning layer tracks which thresholds were already emitted so users are not repeatedly spammed for the same limit crossing. Debug env vars can force remaining percentages for testing, but the real source is the provider response headers.
Practical takeaways
- A burst of parallel subagents can multiply model traffic; the local subagent limiter only caps local parallelism, not upstream account limits.
- For a single foreground chat, model requests are mostly serial because the session queue processes one active turn at a time.
- 429s are not immediately fatal. The CLI first retries with backoff, then may switch to auto mode or pause pending messages.
- A custom/BYOK provider can still return 429, but the auto-mode fallback is GitHub Copilot-specific and is not used for custom providers.
Retry-Aftervalues longer than the model retry policy maximum are treated as too long for the immediate retry loop; the session-level pause may still use retry-after for queued-message recovery.- WebSocket Responses is opportunistic. It can improve streaming behavior, but early WebSocket transport failures fall back to HTTP Responses.
402quota errors and context-window errors are classified, user-facing terminal states, not retry loops.
Created and maintained by Yingting Huang.