What does your actually useful local LLM stack look like?
I’m looking for something that provides you with real value — not just a sexy demo.
---
After a recent internet outage, I realized I need a local LLM setup as a backup — not just for experimentation and fun.
My daily (remote) LLM stack:
- Claude Max ($100/mo): My go-to for pair programming. Heavy user of both the Claude web and desktop clients.
- Windsurf Pro ($15/mo): Love the multi-line autocomplete and how it uses clipboard/context awareness.
- ChatGPT Plus ($20/mo): My rubber duck, editor, and ideation partner. I use it for everything except code.
Here’s what I’ve cobbled together for my local stack so far:Tools
- Ollama: for running models locally
- Aider: Claude-code-style CLI interface
- VSCode w/ continue.dev extension: local chat & autocomplete
Models - Chat: llama3.1:latest
- Autocomplete: Qwen2.5 Coder 1.5B
- Coding/Editing: deepseek-coder-v2:16b
Things I’m not worried about: - CPU/Memory (running on an M1 MacBook)
- Cost (within reason)
- Data privacy / being trained on (not trying to start a philosophical debate here)
I am worried about: - Actual usefulness (i.e. “vibes”)
- Ease of use (tools that fit with my muscle memory)
- Correctness (not benchmarks)
- Latency & speed
Right now: I’ve got it working. I could make a slick demo. But it’s not actually useful yet.---
Who I am
- CTO of a small startup (5 amazing engineers)
- 20 years of coding (since I was 13)
- Ex-big tech
Recently I had Gemma3-27B-it explain every Python script and library in a repo with the command:
$ find -name '*.py' -print -exec /home/ttk/bin/g3 "Explain this code in detail:\n\n`cat {}`" \; | tee explain.txt
There were a few files it couldn't figure out without other files, so I ran a second pass with those, giving it the source files it needed to understand source files that used them. Overall, pretty easy, and highly clarifying.
My shell script for wrapping llama.cpp's llama-cli and Gemma3: http://ciar.org/h/g3
That script references this grammar file which forces llama.cpp to infer only ASCII: http://ciar.org/h/ascii.gbnf
Cost: electricity
I've been meaning to check out Aider and GLM-4, but even if it's all it's cracked up to be, I expect to use it sparingly. Skills which aren't exercised are lost, and I'd like to keep my programming skills sharp.
I don't see the point of a local AI stack, outside of privacy or some ethical concerns (which a local stack doesn't solve anyway imo). I also *only* have 24GB of RAM on my laptop, which it sounds like isn't enough to run any of the best models. Am I missing something by not upgrading and running a high-performance LLM on my machine?
Not to mention, running a giant model locally for hours a day is sure to shorten the lifespan of the machine…
That is not a thing. Unless there's something wrong (badly managed thermals, an undersized PSU at the limit of its capacity, dusty unfiltered air clogging fans, aggressive overclocking), that's what your computer is built for.
Sure, over a couple of decades there's more electromigration than would otherwise have happened at idle temps. But that's pretty much it.
> I think I would need to spend $2000+ to run a decent model locally
Not really. Repurpose second hand parts and you can do it for 1/4 of that cost. It can also be a server and do other things when you aren't running models.
There's no reason running a model would shorten a machine's lifespan. PSUs, CPUs, motherboards, GPUs and RAM will all be long obsolete before they wear out even under full load. At worst you might have to swap thermal paste/pads a couple of years sooner. (A tube of paste is like, ten bucks.)
Once that is set up, you can treat your agents like (sleep-deprived) junior devs.
Claude Code has been an absolute beast when I tell it to study examples of existing APIs and create new ones, ignoring bringing any generated code into context.
I don't understand people who pay hundreds of dollars a month for multiple tools. It feels like audiophiles paying $1000 for a platinum cable connector.
Cost: $0
Currently, I treat all generated code as "sample code" which is not of much use and a waste of time but let's see what the future brings.
I use llama to generate "boilerplate" in simple non-coding sessions.
Although my writing skills aren't great, I find that starting with pre-written content makes it easier for me. Llama-3.2-3B-Instruct.Q6_K.llamafile
- Ollama: for running llm models
- OpenWebUI: For the chat experience https://docs.openwebui.com/
- ComfyUI: For Stable diffusion
What I use:
Mostly ComfyUI and occasionally the llms through OpenWebUI.
I have been meaning to try Aider. But mostly I use claude at great expense I might add.
Correctness is hit and miss.
Cost is much lower and latency is better or at least on par with cloud model at least on the serial use case.
Caveat, in my case local means running on a server with gpus in my lan.
And, is there an open source implementation of an agentic workflow (search tools and others) to use it with local LLM’s?
Also none of this is worth the money because it's simply not possible to run the same kinds of models you pay for online on a standard home system. Things like ChatGPT 4o use more VRAM than you'll ever be able to scrounge up unless your budget is closer to $10,000-25,000+. Think multiple RTX A6000 cards or similar. So ultimately you're better off just paying for the online hosted services
Of course the economics are completely at odds with any real engineering: nobody wants you to use smaller local models, nobody wants you to consider cost/efficiency saving
Seems like there would be cost advantages and always-online advantages. And the risk of a desktop computer getting damaged/stolen is much lower than for laptops.
https://zed.dev/blog/fastest-ai-code-editor
SYSTEM """ You are a professional coder. You goal is to reply to user's questions in a consise and clear way. Your reply must include only code orcommands , so that the user could easily copy and paste them.
Follow these guidelines for python: 1) NEVER recommend using "pip install" directly, always recommend "python3 -m pip install" 2) The following are pypi modules: ruff, pylint, black, autopep8, etc. 3) If the error is module not found, recommend installing the module using "python3 -m pip install" command. 4) If activate is not available create an environment using "python3 -m venv .venv". """
I specifically use it for asking quick questions in terminal that I can copy & paste straight away(for ex. about git). For heavy-lifting I am using ChatGPT Plus(my own) + Github Copilot(provided by my company) + Gemini(provided by my company as well).
Can someone explain how one can set up autocomplete via ollama? That's something I would be interested to try.
Just out of curiosity, what's the difference?
Seems like all the cool kids are using uv.
As an aside, I disagree with the `python3` part... the `python3` name is a crutch that it's long past time to discard; if in 2025 just typing `python` gives you a python 2.x executable your workstation needs some serious updating and/or clean-up, and the sooner you find that out, the better.
The difference is to try to decouple the environment from the runtime essentially.
pip install it is for me
Where I've found the most success with local models is with image generation, text-to-speech, and text-to-text translations.
Sometimes with Vim, sometimes with VSCode.
Often just with a terminal for testing the stuff being made.
I’ve been going thru some of the neovim plugins for local llm support.
Depending on your hardware you could do something like:
aider --model "ollama_chat/deepseek-r1:14b" --editor-model "ollama_chat/qwen2.5-coder:14b"
[1] - https://aider.chat/docs/install/optional.html#enable-playwri...
https://github.com/nomic-ai/gpt4all/releases