R-Zero: Self-Evolving Reasoning LLM from Zero Data

(arxiv.org)

39 points | by lawrenceyan 8 hours ago

4 comments

nakamoto_damacy 4 minutes ago
Perpetual Motion Machines were a thing at some point, too.
jasonjmcghee 5 hours ago
Conceptually, it's effectively a GAN
[-]
- torginus 7 minutes ago
  GAN's are a supervised training method, not really self-improving (after converging to being able to reproduce the training set).
- magicalhippo 36 minutes ago
  For those not in the know, that's Generative Adversarial Networks[1], where two neural networks are trained in a competitive way.
  One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.
  Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.
  [1]: https://en.wikipedia.org/wiki/Generative_adversarial_network
thom 2 hours ago
For values of zero quite far above zero.
[-]
- falcor84 1 hour ago
  What am I missing? From my skimming, there's zero external data beyond what is needed for the Challenger to generate questions.
  [-]
  - thom 31 minutes ago
    An existing trained LLM is an enormous amount of 'data' however it might be encoded. AlphaZero didn't start with Stockfish or a database of games.
    [-]
    - magicalhippo 24 minutes ago
      As I understand it the point of the article isn't to train a LLM from scratch, it's to teach a non-reasoning model to reason without additional explicit training data.
    - tucnak 16 minutes ago
      AlphaZero is oftentimes dragged out to ridicule the so-called "self-play LLM training" techniques, although I don't think these arguments are terribly convincing. You can think of AlphaZero games as effectively synthetic data in adversarial setting; yes, it's easy to produce and verify as the rules of chess are verifiable, so it doesn't require much data on paper. This is not the case for most texts, with some notable exceptions in verifiable domains, where self-play is coincidentally applied most successfully. Thus, you could make an argument that the pre-existing "trained LLM" is merely functioning as a verifier proxy, analogous to the well-defined chess verifier in AlphaZero.
cyberge99 5 hours ago
What could go wrong?
[-]
- magicalhippo 39 minutes ago
  Just don't hook it into the nuclear missile controls. We've seen[1] how that goes[2].
  [1]: https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project
  [2]: https://en.wikipedia.org/wiki/The_Terminator