Why context windows are finite
description
Context windows are finite because attention computation over a transformer scales with sequence length, and serving very long contexts costs more compute, memory, and latency per request.
how it works
A larger context window is not free: it increases the cost and time to process a request, which is part of why providers publish a fixed maximum window size per model rather than an unbounded one.
source note
General, model-agnostic architectural reasoning; not sourced to a specific document mirrored in this repo.
provenance
created 2026-07-02 08:26:58 · JSON