Skip to content

Rate limits, concurrency, retries, and error recovery

Rate limits, concurrency, retries, and error recovery

This document explains how the extracted @github/copilot CLI bundle handles model rate limits, concurrent model work, retries, cancellation, and related error paths. It complements model-api-routing.md, which explains which network API shape is selected for each provider/model.

Executive summary

  • Model API calls are wrapped by a CLI-owned retry loop. The observed model clients disable provider SDK retries with maxRetries: 0 at the call site or client construction point, so retry decisions are centralized in the CLI runtime.
  • Rate limits are handled in two layers: first by request retry/backoff, then by session-level recovery. Session recovery can switch to auto mode, prompt the UI to switch, or pause queued messages until the reset time.
  • Model concurrency is not a single global HTTP throttle. A single session processes its message queue serially, while background subagents, fleet mode, background sessions, and MCP tasks can create parallel model calls.
  • Parallel subagent execution is capped by a dedicated subagent concurrency limiter with account-plan defaults and COPILOT_SUBAGENT_MAX_CONCURRENT override support.
  • WebSocket Responses has special safety rails: the WebSocket adapter cannot be reused concurrently, falls back to HTTP Responses for early 429/500/503 WebSocket failures, and resets WebSocket state when conversation history is rewritten.
  • Non-rate-limit recovery includes HTTP/2 transport reset, undici dispatcher reset, session-token refresh, request-size reduction, transient bad-request retry, streaming-error retry, quota classification, context-limit errors, tool-result normalization, and cancellation propagation.

Source anchors

app.js is bundled/minified, so semantic aliases below are analysis names. Minified anchors are version-specific lookup aids for the analyzed @github/copilot artifact and will shift across releases.

AreaSemantic aliasMinified anchorApprox. lineRole
Model request loopModelRequestLoopU3.getCompletionWithTools(...)app.js 3439Owns per-turn retry, tool loops, model success/failure events, and final failure construction.
Retry-delay resolverresolveModelRetryDelay(...)p3n(...)app.js 3439Reads x-should-retry, retry-after-ms, retry-after, status codes, configured retry codes, exponential backoff, and jitter.
Default model retry optionsinitDefaultModelOptions(...)U3.initDefaultOptions(...)app.js 3439Sets default maxRetries: 5, default rate-limit wait, exponential growth, and maximum retry-after window.
Streaming transport retry processorStreamingErrorRetryProcessor_Ceapp.js 3103Retries non-API streaming errors after a short delay, excluding abort/timeout errors.
Content-annotation retry processorSnippyProcessoraftapp.js 3092Blocks tool execution and retries when Copilot annotations request a soft blocking pass.
SDK retry suppressionDisableSdkRetriesAtModelBoundarymaxRetries: 0app.js 3437, 3439, 3457, 3460Disables provider SDK retries for model clients/calls so the CLI retry loop controls policy.
Rate-limit classifierclassifyRateLimitError(...)Xjn(...), qgr(...), zgr(...), eWn(...), SAr(...)app.js 4161, 4252Converts 429 errors into user-facing rate-limit codes, messages, retry-after seconds, and auto-switch eligibility.
Session rate-limit recoverySessionRateLimitRecoveryprocessQueuedItems(...) branchapp.js 4481, 4487Emits rate-limit errors, switches to auto mode when allowed, prompts the UI, or pauses pending messages.
Rate-limit pause timerwaitForRateLimitPause(...)C8s(...), S8sapp.js 4471Waits for retry-after or a default 120-second pause, cancellable by the session abort signal.
Generic GitHub fetch retryGitHubFetchWithRetrySy(...)app.js 238Retries non-model GitHub API requests with Retry-After, exponential backoff, and circuit breaker integration.
Circuit breakerHostCircuitBreakerb1e, QQapp.js 236-238Opens after repeated network/5xx failures, blocks requests briefly, then half-opens for a probe.
Session queueSessionMessageQueueua.itemQueue, processQueue(...)app.js 4475-4481Serializes normal messages in one session and defers idle while background agents are active.
Serial operation queueSerialOperationQueuenjapp.js 239Generic one-at-a-time async operation queue used by runtime services.
Task registryTaskRegistryB3app.js 3367Tracks background/sync agents, states, waiters, cancellation, completion, and active-time accounting.
Subagent concurrency limiterSubagentConcurrencyLimiterqmtapp.js 3698Caps concurrent subagent execution and releases slots on completion or idle wait.
WebSocket Responses fallbackWebSocketResponsesResiliencePmtapp.js 3469-3472Prevents concurrent reuse, falls back to HTTP Responses on early WebSocket failures, and closes/reset state.
Quota message mapQuotaErrorMessagesRiiapp.js 191Maps 402 quota/billing errors to user-facing messages.
Usage-limit warning UIUsageLimitWarningEmitterwHo(...), THo(...)app.js 6860Emits threshold warnings when weekly/session usage snapshots cross 50/75/90/95 percent.

Model API retry flow

sequenceDiagram
autonumber
participant Session as Session runtime
participant Loop as ModelRequestLoop
participant Adapter as Provider adapter
participant API as Model API
participant Processors as Request processors
Session->>Loop: system prompt, messages, tools, abort signal
Loop->>Processors: preRequest hooks
Loop->>Adapter: send request with SDK retries disabled
Adapter->>API: Chat / Responses / Messages / WebSocket
API-->>Adapter: response or error
Adapter-->>Loop: response, stream, or APIError
alt success
Loop-->>Session: model_call_success + message/tool events
else retryable error
Loop-->>Session: model_call_failure
Loop->>Processors: onRequestError hooks
Loop->>Loop: compute retry delay + jitter
Loop-->>Session: turn_retry
Loop->>Adapter: retry request
else exhausted or non-retryable
Loop-->>Session: turn_failed
Loop-->>Session: throw classified error
end

The observed default model retry policy is:

SettingObserved defaultMeaning
maxRetries5Maximum retry loop attempts before the model turn fails.
defaultRetryAfterSeconds5Base wait when a 429 lacks an explicit reset delay.
initialRetryAfterBackoffExtraSeconds1Extra exponential component for 429 backoff.
retryAfterBackoffExtraGrowth2Exponential growth factor.
maxRetryAfterSeconds180Upper bound; delays above this are treated as too long and the request gives up.

COPILOT_AGENT_ERROR_CODES_TO_RETRY can add status codes or ranges to the model retry policy. The retry resolver also honors provider headers:

Header/statusHandling
x-should-retry: falseDo not retry, even if the status would normally be retryable.
x-should-retry: trueRetry if a valid delay can be computed.
retry-after-msMillisecond retry delay; converted to seconds in the model loop.
retry-afterParsed as seconds or HTTP date.
408, 409, 429, 499Default retryable client/server-boundary statuses for model calls.
>= 500Default retryable server-side statuses for model calls.
configured status/rangeExtra retryable statuses from runtime settings/env.

When no explicit reset delay is available, 429 uses a rate-limit-specific exponential delay; other retryable errors use a shorter exponential delay. Both paths add jitter so concurrent clients do not retry at exactly the same instant.

Rate-limit lifecycle

flowchart TD
Error429["Model call receives HTTP 429"] --> RetryResolver["resolve retry delay"]
RetryResolver -->|delay <= max| Retry["turn_retry after backoff"]
Retry --> Error429
RetryResolver -->|exhausted or delay too long| Throw["throw model failure"]
Throw --> Classifier["session error classifier"]
Classifier --> Event["session.error: rate_limit"]
Event --> AutoEligible{"eligible for auto switch?"}
AutoEligible -->|yes + setting| AutoSwitch["switch to auto model"]
AutoEligible -->|yes + UI listener| Prompt["auto_mode_switch.requested"]
Prompt -->|yes| AutoSwitch
Prompt -->|no| Pause
AutoEligible -->|no| Pause["pause pending queue"]
Pause --> Wait["Retry-After or 120 seconds"]
Wait --> Continue["continue processing queued messages"]

The rate-limit classifier recognizes several fine-grained codes when upstream includes them:

Error codeUser-facing meaningAuto-mode switch suffix
user_weekly_rate_limitedWeekly Copilot rate limit reached.Can suggest waiting or switching to auto model.
user_model_rate_limitedSelected model rate limit reached.Suggests switching models or waiting.
user_global_rate_limitedSession/global limit reached.Not eligible for auto switch in the observed eligibility set.
integration_rate_limitedIntegration/model path rate limit reached.Suggests switching models or waiting.
rate_limitedGeneric rate limit.Can suggest waiting or switching to auto model.

Auto-mode switching is only considered when all of these are true:

  1. the current selected model is not already auto mode;
  2. the session is not using a custom/BYOK provider;
  3. the error code is not excluded from auto-switch eligibility;
  4. the user setting says to continue on auto mode, or the UI accepts the auto_mode_switch.requested prompt.

If no switch happens and there are pending messages, the queue pauses for the upstream retry-after seconds. If upstream does not provide a reset delay, the default pause is 120 seconds. The pause is abortable through the session abort signal.

Model concurrency model

There are multiple concurrency layers; they intentionally solve different problems.

flowchart TD
UserMessages["User/session messages"] --> SessionQueue["SessionMessageQueue"]
SessionQueue -->|serial in one session| MainTurn["main agent model turn"]
MainTurn --> Tools["tool calls"]
Tools --> TaskTool["task tool / subagents"]
TaskTool --> Limiter["SubagentConcurrencyLimiter"]
Limiter --> Agents["parallel subagent sessions"]
Agents --> ModelCalls["parallel model API calls"]
MainTurn --> WebSocket["WebSocket Responses adapter"]
WebSocket -->|one active loop| WsGuard["no concurrent reuse"]
WebSocket --> Nested["nested prompt client for nested calls"]
BackgroundSessions["background sessions"] --> ModelCalls
MCPTasks["MCP task streams"] --> Agents

Within one session

Normal user messages are queued and processed serially. If a turn is already active, additional user messages are appended to itemQueue; immediate system notifications are stored separately and injected at the next request boundary. The session also defers session.idle while active background agents are running.

This means a single foreground session does not normally issue overlapping main-agent model turns. It can still create concurrent model traffic through background subagents and external task mechanisms.

Subagents and fleet mode

Subagent concurrency is controlled by a limiter:

Account planObserved default concurrent subagent slots
Free / Edu2
Pro / Pro Plus4
Max8
Business16
Enterprise / default fallback32

COPILOT_SUBAGENT_MAX_CONCURRENT can override the limit, clamped to the range 1..256. A failed acquire returns a user-facing error such as “Maximum concurrent agent limit … reached.” Slots are released when an agent completes, fails, is cancelled, or enters an idle multi-turn wait. When an idle multi-turn agent receives a new message, it reacquires a slot before resuming.

Fleet mode does not introduce a separate model scheduler. It uses the same task tool and TaskRegistry, so fleet parallelism is bounded by the same subagent limiter and by upstream model/account rate limits.

WebSocket Responses concurrency

The WebSocket Responses adapter has a stricter local guard: one adapter instance cannot be reused for overlapping tool loops. If a nested prompt/model call is needed, it uses a nested HTTP Responses client rather than reusing the same WebSocket client. This avoids corrupting previous_response_id, incremental input state, and connection ownership.

Generic GitHub API retry and circuit breaker

Some non-model GitHub API calls use a separate fetch wrapper with retry and circuit breaker behavior.

stateDiagram-v2
[*] --> Closed
Closed --> Closed: success or non-counted error
Closed --> Open: repeated network/5xx failures
Open --> Open: reject requests immediately
Open --> HalfOpen: reset timeout elapsed
HalfOpen --> Closed: probe succeeds
HalfOpen --> Open: probe fails

Observed behavior:

  • retryable statuses include 429, 500, and 502 for the generic GitHub fetch wrapper;
  • circuit-breaker-counted response statuses include 500, 502, 503, and 504;
  • network/connect/timeout-style errors are counted by the circuit breaker;
  • the default retry count is 5, default delay is 5 seconds, and backoff factor is 2;
  • the circuit opens after repeated failures and rejects requests until its reset window elapses;
  • if an HTTP response does not look like it originated from GitHub and lacks expected tracing headers, it is classified as a possible proxy/firewall interception.

This wrapper is separate from the model request loop. Do not assume every model API call goes through the generic GitHub fetch wrapper; the model adapters call SDK clients directly and then apply the model retry loop above.

Other recovery paths worth knowing

ScenarioRuntime behavior
401 session token expiryThe model loop can call a session-token refresh callback once, update the client token, and retry immediately. If refresh fails for auto mode, the session clears auto-mode state and asks the user to resend.
402 quota/billingNo retry. The session emits session.error with errorType: "quota" and maps known codes such as quota_exceeded, session_quota_exceeded, and billing_not_configured to user-facing guidance.
400 transient bad requestThe model loop retries a small number of times with short backoff before treating it as a query error.
413 request too largeThe model loop removes images/native document attachments from older messages, emits binary_attachments_removed, and retries.
Context-window overflowThe session emits session.error with errorType: "context_limit" and suggests starting a new session or compacting.
HTTP/2 GOAWAY / undici pool assertionThe loop resets the global dispatcher, wraps the problem as a retryable transport error, and retries.
WebSocket Responses early 429/500/503If no meaningful streaming output has arrived, the WebSocket adapter disables that attempt, closes the connection, and falls back to HTTP Responses for the remaining retry path.
Missing Responses completion eventIf accumulated output is substantive, the Responses adapter uses the snapshot as a fallback; otherwise it throws so the outer retry path can retry.
Non-API streaming errorA request processor retries after a short delay, except for abort and timeout errors.
User/session abortAbort signals are passed into model requests, WebSocket waits, tool execution, and rate-limit pause timers. Aborts are not retried as ordinary failures.
Parent subagent cancellationTaskRegistry recursively cancels child agents, aborts their controllers, resolves waiters, and marks tasks as cancelled.
Tool execution failureTool failures are converted into tool-result messages for the model when possible, with telemetry/error fields preserved.
Invalid tool-call JSONTool arguments can be cleared and returned to the model as a structured failure instead of crashing the runtime.
Copilot content annotationsA processor can block tool execution and retry; after max retries it soft-fails and continues.
Git command errorsGit errors are classified by type; known expected cases can set skipReport to avoid noisy crash reporting.

Usage and quota warnings

The model success path reads quota/usage snapshot headers with prefixes such as x-quota-snapshot- and x-usage-ratelimit-. The TUI warning layer watches assistant usage events and emits ephemeral usage-limit warnings at these remaining/used thresholds:

  • over 50 percent used;
  • over 75 percent used;
  • over 90 percent used;
  • over 95 percent used.

The warning layer tracks which thresholds were already emitted so users are not repeatedly spammed for the same limit crossing. Debug env vars can force remaining percentages for testing, but the real source is the provider response headers.

Practical takeaways

  • A burst of parallel subagents can multiply model traffic; the local subagent limiter only caps local parallelism, not upstream account limits.
  • For a single foreground chat, model requests are mostly serial because the session queue processes one active turn at a time.
  • 429s are not immediately fatal. The CLI first retries with backoff, then may switch to auto mode or pause pending messages.
  • A custom/BYOK provider can still return 429, but the auto-mode fallback is GitHub Copilot-specific and is not used for custom providers.
  • Retry-After values longer than the model retry policy maximum are treated as too long for the immediate retry loop; the session-level pause may still use retry-after for queued-message recovery.
  • WebSocket Responses is opportunistic. It can improve streaming behavior, but early WebSocket transport failures fall back to HTTP Responses.
  • 402 quota errors and context-window errors are classified, user-facing terminal states, not retry loops.

Created and maintained by Yingting Huang.