7 comments

  • coppsilgold 9 hours ago
    There are also interesting approaches to more directly compress a large document or an entire codebase into a smaller set of tokens without getting the LLM to wing it. For example, Cartridges: <https://hazyresearch.stanford.edu/blog/2025-06-08-cartridges>

    They basically get gradient descent to optimize the KV cache while freezing the network.

  • az09mugen 11 hours ago
    Unrelated, but 69KB is how much RAM Voyager 1 has.
    • gregman1 10 hours ago
      Voyager as a token of curiosity
  • jasonjmcghee 3 hours ago
    > OpenAI applies it automatically and charges 50% less for cache hits

    This is incorrect. It's 90% cheaper.

    https://developers.openai.com/api/docs/pricing

  • LuxBennu 10 hours ago
    good overview of the architecture side but worth mentioning there's another axis that stacks on top of all of this: you can quantize the kv cache itself at inference time. in llama.cpp you can run q8 for keys and q4 for values and it cuts cache memory roughly in half again on top of whatever gqa or mla already saves you. i run qwen 70b 4-bit on m2 max 96gb and the kv quant is what actually made longer contexts fit without running out of unified memory. keys need more precision because they drive attention scores but values are way more tolerant of lossy compression, so the asymmetry works out.
    • suprjami 9 hours ago
      Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.

      TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.

  • refulgentis 10 hours ago
    Good prose, but it keeps collapsing distinct layers of the stack into one poetic notion of “memory.” KV cache, prompt caching, product-level saved memory, transcript storage, retrieval, summarization, and long-context failure modes are different mechanisms with different failure modes. Once those boundaries disappear, you get lines like “API pricing is the price of remembering." Evocative, sure. Explanatory, not really.

    Same thing in the technical bits.

    “Computation drops from quadratic to linear” is only narrowly true for incremental decoding after the prefix is already processed.

    “When the KV cache gets too large, the standard solution is compaction” is worse: the standard responses are boring systems tricks like limits, eviction, paging/offload, compression, etc. Summarization is usually an application workaround where you throw away old text and replace it with a shorter prompt. The cache never became a summary; the prompt did.

    So I wouldn’t call the piece wrong so much as aggressively smooth. It knows the vocabulary, but it keeps letting metaphor outrun mechanism.

  • algolint 2 hours ago
    [dead]
  • childrapst 3 hours ago
    [dead]