> Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research.
Regardless of the specific performance of this model versus another model, I think it’s good to keep in mind that everyone benefits from this kind of work
When it comes to how "open" LLMs are, there are basically three categories:
- Open Source: The architecture of the model is available, so you can bring your own data and compute to train a similar model.
- Open Weights: The trained model itself is available for you to use. You can download it and run it on your own hardware without needing to train anything from scratch.
- Open Data: You get access to the data that was used to train the model. With this, you can completely reproduce the training process, which is super helpful if you want to fine-tune the model or see exactly how it was trained.
There are some other nuances around hyperparams, training methodology, etc. but that's basically the 3 main categories.
I would probably refer to category 1 as "Open Architecture". I wouldn't want to give anyone the false impression that category 1 is comparable in the slightest to Open Weights, which is vastly more useful.
The difference between metas llama and open ai is akin to the difference between ms excel installed on your machine and google sheets running in the cloud.
Hard disagree and it sounds like you're extrapolating from the traditional software area without considering the nuances.
From the point of view of academia, free transformative products that build on other stuff, transparency, possibility of building even better models in the future and a big etc, there's NO difference between ms excel and Google sheets.
OTOH, the llama models have allowed all of the above and has helped us immensely in both developing new things and being able to understand these new generation of LLMs...all of which would be impossible with openai.
Open weight means you get both the details of the architecture and a way to freely iterate to build new things.
People have different definition of what "freely iterate to build new things" means. For me, having a binary only does not prevent me from doing so.
For example, Minecraft was never distributed with source code, it was binary-only from day one. But the modding community would hard disagree with you if you say there was no way to "freely iterate to build new things", probably in GenZ term, "skill issue" :p
I wonder if I'll ever give a damn about AI LLM and the like.
Maybe that's the generational barrier.
I find that topic so uninteresting, like cars or kitchen apparatuses, or home electronics of the 70s.
I see the technology as energy burning or rather wasting and not beneficial to human evolution and survival, the contrary.
It's not because of the unknown, it will also replace me and remove the joy of building something on my own.
But I bet in 20 30 years it will be like DJing back in the 90s and DJing now.
DJing back then was manual work and art, required skill.
DJing now is mostly effortless and could even be automated, with AI too. It's more of a performance show than mixing skill and art.
Creating something new will be a matter of just defining what you'd like your result to be (as already is the case very often) and refining the steps.
Instead of writing code, you'll be writing or speaking with an AI, which will then generate code.
When I started coding at the age of 11, that was the dream.
But I still can't find the motivation to deal with AI.
I’m 53 and use LLMs all the time. I admit I think my pre-LLM process was different than many people. I’d create the first version that barely works, then spend a lot of time refining and refactoring, simplifying things, bit by bit.
I get the sense that many developers only like the first part where you get things working and they don’t like re-factoring and refining. LLMS often allow me to get the basic implementation done much quicker, leaving me more time for simplifying, which is the part I enjoy.
I'm 49 now, soon 50! I started coding around the same age as you. Interesting.
I use and am impressed with LLM's quite often in a rather basic[ copy problem / understand more of it / paste solution ] way. Probably I will use them inside of IDE's as soon as they are good enough to understand a rather large or complex repo, which today is more hit or miss with files you put at its front.
It's all so incredibly manual, from the guys who literally stand at the front of the ship just looking out for icebergs, to the man running and pulling on a giant ringing lever of some kind to tell the engine room what to do, the guys in the engine room running around in a panic, yelling orders across the room, turning giant wheels, the levers being pulled to shut the engine down, the guys manually shovelling coal stopping and shutting the boilers... it's all so manual
Meanwhile today a majority of that is fully automated and/or a few clicks of a button
The movie would be a hell of a lot more boring today!
Titanic 2024 edition:
Some random person most likely: Iceberg right ahead!
Random staff member: We know... we're going around it
It's the same I see with ICE engines vs EV's, and smart phones and laptops compared to desktops
All the manual work in the world is going away... but I guess if you're really itching for manual work farming and the trades don't seem to be going anywhere... yet
Which runs LLM's on my local PC which is running on my solar and solar battery, so no co2 emissions ^^
and as I have trouble learning, sometimes I have to search for 5 to 10 websites explaining something before I 'get it', with AI it helps for me to poke it and go, explain this, almost like a second person is there just to try and help, sometimes it helps, sometimes it doesn't.
Same with coding, sometimes it might just give me an idea or a thought that I wouldn't of had without it that gets me closer to where I want go.
I was wondering why Figure 1 showed a HumanEval score of 61.6 for Qwen2.5-Coder-7B, but Table 1 shows a score of 88.4, i. e. better than this new model with a score of 66.5.
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).
I was just messing around with LLMs all day, so had a few test cases open. Asked it to change a few things in a ~6KB C# snippet in a somewhat ambiguous, but reasonable way.
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.
Here is quite comprehensive llm for coding leaderboard: https://aider.chat/docs/leaderboards/
And they update it quite quickly with new models releases.
>Due to the prevalence of forking and copy-pasting within the codebase, nearly
75% of files are completely duplicated.
This is surprisingly high. Does the include imported libraries and packages? Since you are hashing at the file level, I am not fully convinced that this is due to people copying entire files over without modification.
I’ve tried a bunch of different models that are essentially different instruction tuning on base models, and that seems to be generally true in my experience. I don’t think you can fine tune your way into a significantly better code model. At best, one that can follow instructions better, but not one that can usually write noticeably better code or solve harder problems.
Anyone doing training where the metadata of the compilation and execution (like profiling data) is included? Maybe such inclusion could help nudge models in more code-efficient directions? I don't know, I'm only a laygenius at this stuff.
I really would like to work full time on LLM for code generation. I have many ideas on how to leverage the context length to produce way better output than current models. My current setup is Zed editor + ollama + qwen-2.5-coder on an M3 Ultra. Will try this model too but the context length of 8K is already a limiting factor. DM me if you have similar interests.
If you can't compete directly due to US export restrictions on cutting edge chips, then leverage the worldwide non-commercial sector. At least to start with. Facebook took the same strategy when faced with the prospect of training models to compete with those of the companies that had invested billions in infrastructure before them.
remote_code
raise ValueError(
ValueError: Loading infly/OpenCoder-1.5B-Base requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.
Looking at the training data. For example a prompt / response like this:
Prompt:
I want to create a Python program that takes a sentence as input and sorts all the words in the sentence in descending order based on their length. How can I achieve this?
Response:
Certainly! You can create a Python program to sort the words in a sentence by their length in descending order. Here’s a step-by-step guide to achieve this: 1. *Split the sentence into words*: Use the `split()` method to break the sentence into a list of words. 2. *Sort the words by length*: Use the `sorted()` function with a custom key to sort the words by their length in descending order. 3. *Join the sorted words*: Use the `join()` method to combine the sorted words back into a single string. Here's a complete Python program that demonstrates this: ```python def sort_words_by_length(sentence): # Step 1: Split the sentence into words words = sentence.split() # Step 2: Sort the words by length in descending order sorted_words = sorted(words, key=len, reverse=True) # Step 3: Join the sorted words back into a single string sorted_sentence = ' '.join(sorted_words) return sorted_sentence # Example usage sentence = "I want to create a Python program" sorted_sentence = sort_words_by_length(sentence) print(sorted_sentence) ``` ### Explanation: - *`sentence.split()`*: This method splits the input sentence into a list of words based on whitespace. - *`sorted(words, key=len, reverse=True)`*: This sorts the list of words based on their length. The `key=len` argument specifies that the length of each word should be used for sorting, and `reverse=True` ensures the sorting is in descending order. - *`' '.join(sorted_words)`*: This joins the sorted list of words back into a single string, with each word separated by a space. ### Example Output: For the input sentence `"I want to create a Python program"`, the output will be: ``` create program Python want ``` This output shows the words sorted by length in descending order.
It strikes me that it would easier to train a NN (or another mechanism) if the training that was more structured and the actual question/answer was stripped from the plesentaries / additional explanations.
Also keep the training data to one language (say english / python).
Ie.:
Prompt:
sort the words in a sentence by their length in descending order
Response:
sorted(sentence.split(), key=len, reverse=True)
Alternative one could use snippets like above and the synthesize "realistic" prompt / responses.
You can pay for Claude API access (not normal Claude Pro) and wire in something like Cline via your API key, but it gets expensive fast in my experience.
the company i work for and actually most Swiss IT contractors have harsh rules, and more than half of our projects, we aren't allowed to use Github Copilot or pasting stuff to any LLM API.
For that matter I built a vLLM based local GPU machine for our dev squads as a trial. Currently using a 4070Ti Super with 16GB Vram and upgrading to 4x 4070Ti Super to support 70b models.
The difficulties we face IMHO:
- Cursor doesn't support WSL Devcontainers
- Small Tab-Complete models are more important, and there's less going on for those
- There's a huge gap between 7-14b and 120b models, not a lot of 70b models available
In reality, on 7-14b nothing beats Qwen2.5 for interactive coding and something around 2b for tab-completion
Question for those using it. Can the 7B really be used locally on a card with only 16GB VRAM? LLM Explorer says[1] it requires 15.4GB. That seems like cutting it close.
I am happily using qwen2.5-coder-7b-instruct-q3_k_m.gguf with a context size of 32768 on an RTX 3060 Mobile with 6GB VRAM using llama.cpp [2]. With 16GB VRAM, you could use qwen2.5-7b-instruct-q8_0.gguf which is basically indistinguishable from the fp16 variant.
Bad auto-URL-extraction, presumably. The PDF reads:
> Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models,
arXiv replaces any URL in the text of the abstract with a link with text "this http url"; it seems the authors did not know this and just embedded a bare URL in their abstract.
A more restrictive TLD list would have prevented this, but I certainly don't want to be the one to add new TLDs all the time, so I can see why the code looks like it does.
making a wild guess on the nationality of every author of this paper (1), and observing the number of authors, and observing the velocity and volume of similar papers.. it seems a pattern of "English language as a service to automated programming environments" appears to be very useful and relevant for people (nations?) that are wholly and firmly not English speaking..
(1) is M-A-P or INFtech dot ai a well-known institutional affiliation?
To be clear: INFTech is a for-profit (I think…?) firm out of Shanghai, and MAP is an international FOSS collective (https://m-a-p.ai/about).
Speaking generally, a lot of software engineering worldwide is done in English, so it makes sense that they’re training models in English even if some/most of the researchers also speak a Chinese language. Plus, HuggingFace is English-native, and working on FOSS models (FOSLMs?) without targeting that community would be like making a command line accounting tool and not immediately posting it to the HackerNews community.
Your comment seems to imply some sort of hidden motivation, but idk, seems pretty straightforwardly benign to me! Plus it’s hard to say how many papers are published in other languages about LLMs, considering we wouldn’t read them.
Regardless of the specific performance of this model versus another model, I think it’s good to keep in mind that everyone benefits from this kind of work
- Open Source: The architecture of the model is available, so you can bring your own data and compute to train a similar model.
- Open Weights: The trained model itself is available for you to use. You can download it and run it on your own hardware without needing to train anything from scratch.
- Open Data: You get access to the data that was used to train the model. With this, you can completely reproduce the training process, which is super helpful if you want to fine-tune the model or see exactly how it was trained.
There are some other nuances around hyperparams, training methodology, etc. but that's basically the 3 main categories.
I think open-weight is somewhere between open source and binary.
Reason: cannot be reproduced or practically modified without the source data collection.
The difference between metas llama and open ai is akin to the difference between ms excel installed on your machine and google sheets running in the cloud.
From the point of view of academia, free transformative products that build on other stuff, transparency, possibility of building even better models in the future and a big etc, there's NO difference between ms excel and Google sheets.
OTOH, the llama models have allowed all of the above and has helped us immensely in both developing new things and being able to understand these new generation of LLMs...all of which would be impossible with openai.
Open weight means you get both the details of the architecture and a way to freely iterate to build new things.
For example, Minecraft was never distributed with source code, it was binary-only from day one. But the modding community would hard disagree with you if you say there was no way to "freely iterate to build new things", probably in GenZ term, "skill issue" :p
... but they did not specify their pov?
It's not because of the unknown, it will also replace me and remove the joy of building something on my own. But I bet in 20 30 years it will be like DJing back in the 90s and DJing now. DJing back then was manual work and art, required skill. DJing now is mostly effortless and could even be automated, with AI too. It's more of a performance show than mixing skill and art.
Creating something new will be a matter of just defining what you'd like your result to be (as already is the case very often) and refining the steps. Instead of writing code, you'll be writing or speaking with an AI, which will then generate code.
When I started coding at the age of 11, that was the dream. But I still can't find the motivation to deal with AI.
I'm 49 now, soon 50.
I get the sense that many developers only like the first part where you get things working and they don’t like re-factoring and refining. LLMS often allow me to get the basic implementation done much quicker, leaving me more time for simplifying, which is the part I enjoy.
I'm 49 now, soon 50! I started coding around the same age as you. Interesting.
I use and am impressed with LLM's quite often in a rather basic[ copy problem / understand more of it / paste solution ] way. Probably I will use them inside of IDE's as soon as they are good enough to understand a rather large or complex repo, which today is more hit or miss with files you put at its front.
AI is freaking awesome. Whatever project I choose to do now can be more complex or difficult than previously.
Titanic (1997) - Iceberg,Right ahead https://www.youtube.com/watch?v=bYOn3-PhA9c
It's all so incredibly manual, from the guys who literally stand at the front of the ship just looking out for icebergs, to the man running and pulling on a giant ringing lever of some kind to tell the engine room what to do, the guys in the engine room running around in a panic, yelling orders across the room, turning giant wheels, the levers being pulled to shut the engine down, the guys manually shovelling coal stopping and shutting the boilers... it's all so manual
Meanwhile today a majority of that is fully automated and/or a few clicks of a button
The movie would be a hell of a lot more boring today!
Titanic 2024 edition:
Some random person most likely: Iceberg right ahead!
Random staff member: We know... we're going around it
But your comment reminds of this:
What will AI Programming look like in 5 Years? https://www.youtube.com/watch?v=eaedq1Jl2fc
It's the same I see with ICE engines vs EV's, and smart phones and laptops compared to desktops
All the manual work in the world is going away... but I guess if you're really itching for manual work farming and the trades don't seem to be going anywhere... yet
As to AI, I quite like it, I use:
https://www.nomic.ai/gpt4all
Which runs LLM's on my local PC which is running on my solar and solar battery, so no co2 emissions ^^
and as I have trouble learning, sometimes I have to search for 5 to 10 websites explaining something before I 'get it', with AI it helps for me to poke it and go, explain this, almost like a second person is there just to try and help, sometimes it helps, sometimes it doesn't.
Same with coding, sometimes it might just give me an idea or a thought that I wouldn't of had without it that gets me closer to where I want go.
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
For posterity, I saved it here: https://pastebin.com/VRXYFpzr
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.
irrefutable proof we have AGI. it's here. they are as sentient as any human in my code reviews
I'm interested !
This is surprisingly high. Does the include imported libraries and packages? Since you are hashing at the file level, I am not fully convinced that this is due to people copying entire files over without modification.
However, it doesn't seem trivial to do deduplication in that case without removing relevant/necessary context.
those who don't remember history are bound to repeat it or something.
Looking at the training data. For example a prompt / response like this:
Prompt:
I want to create a Python program that takes a sentence as input and sorts all the words in the sentence in descending order based on their length. How can I achieve this?
Response:
Certainly! You can create a Python program to sort the words in a sentence by their length in descending order. Here’s a step-by-step guide to achieve this: 1. *Split the sentence into words*: Use the `split()` method to break the sentence into a list of words. 2. *Sort the words by length*: Use the `sorted()` function with a custom key to sort the words by their length in descending order. 3. *Join the sorted words*: Use the `join()` method to combine the sorted words back into a single string. Here's a complete Python program that demonstrates this: ```python def sort_words_by_length(sentence): # Step 1: Split the sentence into words words = sentence.split() # Step 2: Sort the words by length in descending order sorted_words = sorted(words, key=len, reverse=True) # Step 3: Join the sorted words back into a single string sorted_sentence = ' '.join(sorted_words) return sorted_sentence # Example usage sentence = "I want to create a Python program" sorted_sentence = sort_words_by_length(sentence) print(sorted_sentence) ``` ### Explanation: - *`sentence.split()`*: This method splits the input sentence into a list of words based on whitespace. - *`sorted(words, key=len, reverse=True)`*: This sorts the list of words based on their length. The `key=len` argument specifies that the length of each word should be used for sorting, and `reverse=True` ensures the sorting is in descending order. - *`' '.join(sorted_words)`*: This joins the sorted list of words back into a single string, with each word separated by a space. ### Example Output: For the input sentence `"I want to create a Python program"`, the output will be: ``` create program Python want ``` This output shows the words sorted by length in descending order.
It strikes me that it would easier to train a NN (or another mechanism) if the training that was more structured and the actual question/answer was stripped from the plesentaries / additional explanations.
Also keep the training data to one language (say english / python).
Ie.:
Prompt:
sort the words in a sentence by their length in descending order
Response:
sorted(sentence.split(), key=len, reverse=True)
Alternative one could use snippets like above and the synthesize "realistic" prompt / responses.
Qwen is really ahead of the pack right now when it comes to weight-available models.
I don't see why you would use it over claude and 4o-mini with cursor unless you are working on a top secret repo
For that matter I built a vLLM based local GPU machine for our dev squads as a trial. Currently using a 4070Ti Super with 16GB Vram and upgrading to 4x 4070Ti Super to support 70b models.
The difficulties we face IMHO:
- Cursor doesn't support WSL Devcontainers
- Small Tab-Complete models are more important, and there's less going on for those
- There's a huge gap between 7-14b and 120b models, not a lot of 70b models available
In reality, on 7-14b nothing beats Qwen2.5 for interactive coding and something around 2b for tab-completion
If it works for you, devcontainers now work under Linux w/ docker.
Plenty of companies won't let you use those products with our internal code.
1. https://llm.extractum.io/model/Qwen%2FQwen2.5-7B,58qKLCI6ani...
[1] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
[2] https://github.com/ggerganov/llama.cpp
Is this slob?
> Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models,
"systems.while" is obviously not a valid domain.
A more restrictive TLD list would have prevented this, but I certainly don't want to be the one to add new TLDs all the time, so I can see why the code looks like it does.
(1) is M-A-P or INFtech dot ai a well-known institutional affiliation?
Speaking generally, a lot of software engineering worldwide is done in English, so it makes sense that they’re training models in English even if some/most of the researchers also speak a Chinese language. Plus, HuggingFace is English-native, and working on FOSS models (FOSLMs?) without targeting that community would be like making a command line accounting tool and not immediately posting it to the HackerNews community.
Your comment seems to imply some sort of hidden motivation, but idk, seems pretty straightforwardly benign to me! Plus it’s hard to say how many papers are published in other languages about LLMs, considering we wouldn’t read them.
I gave it a few tries but couldn't figure it out.