> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.
> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.
I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?
Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.
Esolang vs mainstream paradigm.
Popular vs scarce training data.
So you'd want to control for training data (e.g. brainfuck vs Odin?)
And ideally you'd control by getting it back to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.
Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
If the llm has “skills” for that language, it will definitely increase accuracy.
Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.
Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.
I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.