That's almost exactly my setup and I'm very happy with its performance.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
I keep playing around with this exact concept. While I don’t always trust entirely AI generated recipe, more traditional setups are super rigid when it comes to ingredients
I kept getting recipes with "that one ingredient", which was either a major PITA to source or produced too much waste, even from a real world dietician consultation. Example, use 1/4th of a pumpkin for something. Those were good recipes, in terms of macronutrient composition, but doesn't work long term due to logistics.
I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.
I know the big labs like to pretend that their models are trillion parameter. But how likely is that really to be the case when Qwen 3.6 35B A3B gets so close to their performance? Seems that with the best research applied, best training data, they'd be able to top the charts with a 60B model quite easily.
Not having a lot of experience with this, I ask a naive question: is there a world where you can take your local LLM and hook it up to Claude and get more Claude-like results from your local model? Obviously, there are going to be material differences in how these perform, but are we getting close to a place where this is viable? I imagine that the answers are a combination of “not yet” and “yes but it’s a lot slower” and “yes but there is actually little point to doing this because ‘what Claude gets you’ is highly baked into anthropic’s models and that’s part of what you’re paying for.”
You can use ollama as the backend for claude code!
ollama launch claude --model
I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".
I have a "task router" that is a small local LLM on my mac mini (Qwen 3.5 0.8B) that I use to decide (when activated) with Pi whether to route a given task to my local LLM (Step 3.7 Flash) or to <given cloud provider>, if that counts? It works surprisingly well really. Though some of the cloud providers are getting so good and so cheap (GLM 5.1/5.2, MiniMax M3, among others) that the need to use my local one becomes less and less relevant, depressingly!
You're kinda talking about Claude being used for planning/architect role, while local LLM is just executing it (performing edits) -- at least in such form it's a thing, yes.
I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.
The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.
Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.
It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.
This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.
The recommended values for Qwen 3.6 in thinking mode is `--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00`, and `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00` for coding/tool calling tasks, and for non-thinking, `--temp 0.7 -top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00`.
The options listed are none of these.
Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal).
80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
Do you get anything useful out of your 4090 (I have one too)? Local cloud sounds like a fun idea but I just don’t see how it competes against OpenAI/Anthopic
I think it’s not really worth it compared to just buying tokens or a coding plan.
My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)
I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.
The software stack is pretty immature, definitely very DIY. Their officially supported models are pretty old at this point, though there’s community support for gemma4, and models with GDN like qwen3.6 is supposedly very close.
The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.
A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.
I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each.
It works.
I saw that there is also a 4x Oculink card, but i don't know it that will work, too.
Sits in silence, watching China as they innovated a new type of ultra-thin gpu board and calling it 5090 "Turbos." Still waiting for Shenzhen listings to post a 5090 official verified with VBIOS crack...
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5.
Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).
CNS means Chipset not supported and I doubt it is the case, are you sure you are using the patched nvidia module? modinfo nvidia to check which one is loaded
I'm using bazzite on my ai-rig just because it has the gpu-optimized things setup (also nvidia-open).
Looking at P2P seems to be available only for 90-versions of the nvidia rtx gpu line, not 80, and some versions of 50xx? (apparently the 5080?).
Anyways, i downloaded that uncensored model and tweaked those kv settings etc. still getting 60-80tk/s but im able to get my context on 180224 now, used to be 131072 which gave me some trouble, this is already a win :)
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on
Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.
I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...
> I pay ~3$ per 1M/tokens for that model on Openrouter
I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?
Or subsidies from hopeful investors sweet-talked into not understanding the commodity nature of the business they are investing in. But that does not change much about the general assessment.
Chances are the typical story goes founders start fully believing that they would succeed with their own innovation but slip down a gradient towards commodity provider without really noticing themselves.
I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.
I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.
I'm running Qwen3.6-35B-A3B on a very ordinary desktop PC (32GB DDR5, 8GB Radeon 6600XT) and getting a useful 15-20 tok/sec out of it. The MoE architecture and auto offloading from system to VRAM is just fantastic. Unsloth Q4_K_XL.
The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.
It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.
I wonder what parent poster means with „useful” and what he actually tried? Feels like he was just comparing some benchmarks.
Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.
I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.
Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.
An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.
I think it's important to be able to do both so you can stay in control of the price to value created relationship.
In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.
It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business
Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.
Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
It is absolutely mind blowing to see some of the responses here. Open source, run-your-own, pay for nothing, we’re-all-nerds-that-buy-the-hardware-anyways ethos seems basically dead.
I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.
Times are changing. The open-weight models have needed time to catch up, but they're finally at a point now where we can get almost frontier level capabilities for coding.
I just wish we had a way to actually benchmark them properly though. Still seems no one has solved the problem of software architecture, brittleness and bloat as the codebase grows. Models love to add stuff, but they rarely clean up as they go. In a perfect world they'd do both near equally as they're developing.
It would be nice if there was an "architecture quality" benchmark that distilled the essence of what it means to have a good architecture, but I suppose that's an open research question with a lot of variables? Like how is good architecture actually quantified and measured? Is there a mechanism that can be re-used across all codebases to clearly denote one that is good and one that is bad, or is it highly subjective and depend on the lens you're looking at it from? Is there a lot more to it than just "how much refactoring effort is required to extend this in the future?".
Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.
I tried implementing qwen through openrouter and deepinfra. Even without thinking, I had to wait 60s+ for the full result, where haiku or flash would be done in 5 or 6 seconds.
It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.
I am actually surprised with the power draw, the box itself idles at 20W, which already amazes me for a Ryzen; when computing, I barely pass the 600W bar, and as I am not really using it to vibecode an entire system, I don't even notice the spikes on the power monitor (Shelly + homeassistant).
2xRTX5080 would be awesome. You'd only be able to run a q6, which it's already pretty good, but moreover you'd be able to use P2P and use Blackwell full speed, which I can't.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.
do you mean by commanding a browser? or using APIs?
Because if they don't imply that size is needed for every task, they'll end up tanking their valuations.
https://blog.nilesh.io/post/ai-profit-race
The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.
Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.
https://pi.dev/
The options listed are none of these.
Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal).
Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.
My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)
It's surprising how little these things come up given the price they go for
The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.
A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb
NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb
GPU0 GPU1
GPU0 X CNS
GPU1 CNS X
i guess not, i use llama.cpp with:
--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1
and my (gen) tk/s are between 60-80 tk/s
will test this uncensored model and ngram added as well this weekend
btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...
Do be sure to use dflash and/or mtp for the draft:
https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3
https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3
90 t/s for 27B Q8 256k context
260 t/s for 35B-A3B Q8 256k context
I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
Chances are the typical story goes founders start fully believing that they would succeed with their own innovation but slip down a gradient towards commodity provider without really noticing themselves.
The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.
It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.
Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.
I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.
Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
How would that change (improve) if you had two R9700 in a similar configuration ?
In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.
[1] https://alexhans.github.io/posts/aider-with-open-router.html
And noise.
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.
I just wish we had a way to actually benchmark them properly though. Still seems no one has solved the problem of software architecture, brittleness and bloat as the codebase grows. Models love to add stuff, but they rarely clean up as they go. In a perfect world they'd do both near equally as they're developing.
It would be nice if there was an "architecture quality" benchmark that distilled the essence of what it means to have a good architecture, but I suppose that's an open research question with a lot of variables? Like how is good architecture actually quantified and measured? Is there a mechanism that can be re-used across all codebases to clearly denote one that is good and one that is bad, or is it highly subjective and depend on the lens you're looking at it from? Is there a lot more to it than just "how much refactoring effort is required to extend this in the future?".
Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.
Occam’s razor rings true here: where’s the money in it?
Same here. There has to be someplace like this that's managed to cultivate a better crowd, but I'll be darned if I can find it.
If you're not power limiting in nvidia-smi, start.