Three types of LLM workloads and how to serve them

(modal.com)

61 points | by charles_irl 14 hours ago

2 comments

rippeltippel 8 hours ago
> Gallia est omnis divisor in partes tres.
OCD-driven fix: The correct Latin quote is "Gallia est omnis divisa in partes tres".
[-]
- charles_irl 3 hours ago
  oof ty, willfix
ZsoltT 4 hours ago
> we recommend using SGLang with excess tensor parallelism and EAGLE-3 speculative decoding on live edge Hopper/Blackwell GPUs accessed via low-overhead, prefix-aware HTTP proxies
lord
[-]
- charles_irl 2 hours ago
  Sorry to lead with a bunch of jargon! Wanted to make it obvious that we'd give concrete recommendations instead of palaver.
  The technical terms there are later explained and diagrammed, and the recommendations derived from something close to first principles (e.g. roofline analysis).