1 comments

  • kinderpingui 1 hour ago
    I wrote this to explain how MoE achieves 64x more parameters with only 2x compute cost. The key insight is sparse activation - each token only hits 2 experts out of 8, while all expert weights stay loaded.

    Includes working PyTorch code for the routing mechanism and shows how experts naturally specialize (one handles punctuation, another numbers, etc) without explicit supervision.

    Models like Mixtral-8x7B (45B params, runs like 14B) prove this works at scale. Happy to answer questions about the implementation.