Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

(github.com)

124 points | by Danau5tin 1 day ago

11 comments

tjungblut 1 day ago
If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.
[1] https://github.com/volcengine/verl
[2] https://arxiv.org/abs/2409.19256v2
anorwell 1 day ago
Some of the comments so far seem to be misunderstanding this submission. As I understand it:
1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.
2. The author has built an RL system, but it has not been used for anything due to cost limitations.
So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).
[1] https://www.tbench.ai/leaderboard
[-]
- esafak 1 day ago
  It looks like the submission has two aspects that are being conflated.
  1. Tooling for training a terminal agent.
  2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.
OtherShrezzing 1 day ago
That you've spent in the low-thousands (by the looks of it), and managed to beat GPT4.1 is an amazing insight into the moat of the big AI labs.
rboyd 1 day ago
Great work! There should be a way for entities to crowdfund model training. Can a model like this be partially evaluated during training time and save through early stopping?
What are the best papers/resources on sota long-horizon RL?
Thanks.
TarasBob 1 day ago
I'm willing to help fund this if the creator is interested. I sent him an email.
enigma101 1 day ago
Did you consider a kickstarter to overcome the gpu poorness??? 30 to 50 should be doable
bravesoul2 1 day ago
Wow amazing! Amazing a "one person band" can do this much. It crosses many skillets.
thomasfromcdnjs 1 day ago
How much did you spend?
lostmsu 11 hours ago
Why do you need 50k? Can't you tune using LoRA?
[-]
- Danau5tin 11 hours ago
  Exactly my first thought when I realised the cost! Currently LoRA is not supported by rLLM (The team told me they aim to support in next release), but it is certainly possible to port to verl directly or another RL framework for sure. I just did not have the time to port again (already done 2x as other RL frameworks had issues)
erdaltoprak 1 day ago
This is incredible work