Selection rather than prediction

(voratiq.com)

36 points | by languid-photic 4 days ago

8 comments

DoctorOetker 16 minutes ago
I have the impression the implied conclusion is that under the situation described it would be better to consult different LLM models, than a specific one, but that is not what they demonstrate:
to demonstrate this you measure the compute / cost of running and human-verifying the output.
the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:
at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).
So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%
Same for 3x top-1 vs 1x top-3: 56.10% vs 51%
Same for 4x top-1 vs 1x top-4: 66.63% vs 66%
Same for 5x top-1 vs 1x top-5: 74.64% vs 73%
Same for 6x top-1 vs 1x top-6: 80.73% vs 83%
Same for 7x top-1 vs 1x top-7: 85.35% vs 90%
Same for 8x top-1 vs 1x top-8: 88.87% vs 95%
I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.
majormajor 5 hours ago
> When you commit to a single agent, you're predicting it will be best for whatever task you throw at it.
A quibbe with this: you're not predicting it will be the best for whatever task you're throwing at it, you're predicting it will be sufficient.
For well-understood problems you can get adequate results out of a lot of models these days. Having to review n different outputs sounds like a step backwards for most tasks.
I do this sort of thing at the planning stage, though. Especially because there's not necessarily an obvious single "right" answer for a lot of questions, like how to break down a domain, or approaches to coordinating multiple processes. So if three different models suggest three different approaches, it helps me refine what I'm actually looking for in the solution. And that increases my hit rate for my "most models will do something sufficient" above claim.
[-]
- languid-photic 2 hours ago
  This is a good point!
  We still code via interactive sessions with single agents when the stakes are lower (simple things, one off scripts, etc). But for more important stuff, we generally want the highest quality solution possible.
  We also use this framework for brainstorming and planning. E.g. sometimes we ask them to write design docs, then compare and contrast. Or intentionally under-specify a task, see what the agents do, and use that to refine the spec before launching the real run.
chr15m 4 hours ago
If you view LLM driven dev as a kind of evolutionary process rather than an engineering process (at the level of a single LLM output) then this makes a lot of sense. You're widening the population from which you select for fitness.
[-]
- languid-photic 2 hours ago
  This was exactly the kernel of the idea :)
  [-]
  - chr15m 1 hour ago
    Ah interesting. Thank you very much for sharing the illuminating results.
    One question I had - was the judgement blinded? Did judges know which models produced which output?
    [-]
    - languid-photic 1 hour ago
      It was not, the agent id is not overt but can be found via the workspace filepath.
      But that is a good point. Perhaps it should be mapped to something unidentifiable.
bisonbear 10 hours ago
Intuitively makes sense, but in my experience, a more realistic workflow is using the main agent to sub-agent delegation pattern instead of straight 7x-ing token costs.
By delegating to sub agents (eg for brainstorming or review), you can break out of local maxima while not using quite as many more tokens.
Additionally, when doing any sort of complex task, I do research -> plan -> implement -> review, clearing context after each stage. In that case, would I want to make 7x research docs, 7x plans, etc.? probably not. Instead, a more prudent use of tokens might be to have Claude do research+planning, and have Codex do a review of that plan prior to implementation.
[-]
- languid-photic 1 hour ago
  Yes, understandable.
  The question is which multi-agent architecture, hierarchical or competitive, yields the best results under some task/time/cost constraints.
  In general, our sense is that competitive is better when you want breadth and uncorrelated solutions. Or when the failure modes across agents are unknown (which is always, right now, but may not be true forever).
- girvo 9 hours ago
  > straight 7x-ing token costs
  You are probably right, but my work pays for as many tokens as I want, which opens up a bunch of tactics that otherwise would be untenable.
  I stick with sub-agent approaches outside of work for this reason though, which is more than fair a point
- darkerside 4 hours ago
  Maybe an evolution based approach does make sense. 3x instead, and over time drop the least effective agents, replacing them with even random choices.
  Edit: And this is why you should read the article before you post!
  [-]
  - languid-photic 2 hours ago
    Yes indeed, you get a big lift out of running just the few top agents.
    We run big ensembles because we are doing a lot of analysis over the system etc
fph 10 hours ago
AI is like XML: if it doesn't solve your problem, you are not using enough of it.
[-]
- pixl97 6 hours ago
  Great. You just taught the future AI terminator that AI is like violence
jmalicki 10 hours ago
Is this going to get my Claude subscription cancelled if I run it with a claude backend, given that it's orchestrating the CLI? I am still a little unclear about the boundaries of that.
tomtom1337 12 hours ago
Any suggestions for «orchestrating» this type of experiment?
And how does one compare the results in a way that is easy to parse? 7 models producing 1 PR each is one way, but it doesn’t feel very easy to compare such.
[-]
- languid-photic 1 hour ago
  https://github.com/voratiq/voratiq
  For comparison, there's a `review` command that launches a sandboxed agent to review a given run and rank the various implementations. We usually run 1–3 review agents, pull the top 3 diffs, and do manual review from there.
  We're working on better automation for this step right now.
tug2024 4 hours ago
[dead]