Rendered at 07:07:52 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Greenpants 12 hours ago [-]
I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
lambda 12 hours ago [-]
This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
chakspak 12 hours ago [-]
Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
havfo 8 minutes ago [-]
I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with
I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.
thefroh 2 hours ago [-]
I'm a little surprised that preserve_thinking would matter here for cache purposes. for actual capabilities/intelligence, yes, I'd imagine it helps to have past reasoning traces in multi-turn setups.
but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.
stymaar 42 minutes ago [-]
> all you are doing is leaving off a fraction of the most recent assistant message generation
True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).
ndom91 11 hours ago [-]
+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.
I'll have to give the preserve_thinking a shot.
jderekw 8 hours ago [-]
Thanks for sharing have been running ROCm primarily with Qwen 3.6 and Qwen Coder, on the runs much better statement is that a stability, performance or other capability your experiencing?
dnautics 10 hours ago [-]
> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
lambda 9 hours ago [-]
So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
carterschonwald 7 hours ago [-]
thats a harness issue not a model issue. eg i have my own reasoninf harness that forced persisted cot
thefossguy69 57 minutes ago [-]
Would you mind sharing your harness for reasoning?
dnautics 8 hours ago [-]
wait do sota models use mamba-like SSMs? this is the first im hearing this
nl 7 hours ago [-]
Qwen 3.5 and above use Gated DeltaNet which alternate attention and SSM layers:
There is a bug in llama-cpp for qwen/gemma models, use vLLM instead
pdyc 2 hours ago [-]
what bug and it affects what?
LoganDark 11 hours ago [-]
What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
mahadevank 3 hours ago [-]
Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanks
fjdjshsh 6 hours ago [-]
>I'm still a AI skeptic
What does this mean in June 2026 wrt coding?
To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.
femto113 4 hours ago [-]
For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.
luipugs 38 minutes ago [-]
Don't you read through all the output of the agent before committing them?
secult 15 minutes ago [-]
That's not the way how human brain works.
HWR_14 4 hours ago [-]
I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.
adyavanapalli 12 hours ago [-]
For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
pieterk 8 hours ago [-]
Yup, I used this for a while and IME it may get you a few percentages more of useful context initially, so quality feels a bit higher, but things start breaking down in funnier ways when you do run out of that quality for any reason later, so definitely caveat emptor.
ojr 8 hours ago [-]
I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB, the price for privacy is very high. Agentic flows that get stuck can be worked around but I prefer developer velocity.
danans 1 hours ago [-]
> I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB
And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.
disqard 7 hours ago [-]
Under-rated take, thanks for stating this!
Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.
tpm 3 hours ago [-]
It's ok if you can send your code and data to the provider. Some of us can't.
_zoltan_ 56 minutes ago [-]
We're discussing home use.
You can. You just don't want to. Huge difference.
kristopolous 46 minutes ago [-]
I've got a tool that sits in between the harness and inference engine called petsitter. It is a middleman validator to avoid just these kinds of issues. You can stack the fixes as needed (they're called tricks in the petsitter parlance)
> It gets into loops quite often, and surprisingly often gets the edit tool call wrong
I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
girvo 8 hours ago [-]
Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.
Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!
One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.
There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!
geophile 7 hours ago [-]
My experience is almost identical. I have found that I need to be very careful with planning, breaking things down into small isolated steps (I can have qwen do this); and also (me) writing a very clear design. Relying on qwen to fill in a lot of those precise details results in those about-to-write loops.
Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.
nicman23 1 hours ago [-]
about the edit tool it is almost always trailing white spaces. if you give it a skill with a sed 's/( )*$//g' or something like that it speeds up things
gwerbin 6 hours ago [-]
I've noticed the same about the edit tool, in both Gemma and Qwen. Maybe I'm not running them with the right sampler settings, but I'm happy to hear I'm not the only one. Lots of mismatched whitespace and stuff, the model ends up doing hex dumps and maybe 5 or 6 attempts at editing a 5-line function into a 250-line Python file.
All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).
p0w3n3d 14 minutes ago [-]
which coding agent are you using?
ltononro 11 hours ago [-]
What kind of coding do you do?
Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever?
(not being judmental, just really wanto to know your framework here)
Greenpants 11 hours ago [-]
Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
kordlessagain 5 hours ago [-]
I'm adding Pi to Nemesis8 right now because I saw your comment, so thank you!
Could you give more details on how to make such a set up?
I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?
I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.
That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.
The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.
Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.
pieterk 8 hours ago [-]
Yup, it's fantastically useful.
Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.
I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.
It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.
*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.
westoque 6 hours ago [-]
> Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.
that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.
physix 6 hours ago [-]
The dilemma I am facing is cost.
Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.
Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).
So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.
willisrocks 1 hours ago [-]
Or you can get the best of both worlds--use frontier models to build a spec/plan, and use cheap models (open source or not) for implementation. Your max or team plan can go a lot further this way without giving up much for quality. Play with something like Superpowers to make this really approachable.
bxk76 5 hours ago [-]
Best insights can be over rated due to bandwith limitation of the brain. Even if Einstein is sitting next to you the whole day and helping out Theory of Bounded Rationality applies.
0xbadcafebee 12 hours ago [-]
The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.
awllau 5 hours ago [-]
Based on your explanation, it doesn't sound feasible for me, a complete non-engineer, to switch to fully offline? I do a lot of back and forth discussion with LLMs as someone who reads and writes 0 code.
spullara 8 hours ago [-]
This is the only setup that I think is reasonable to use locally right now. I had an agent set it up for me from this guys recipe:
One thing I did change was the context length to 256k rather than 64k.
hparadiz 12 hours ago [-]
I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.
bandrami 25 minutes ago [-]
> You're gonna be googling the CLI switches for at least 10 minutes
So there's this really amazing program called "man"
bluerooibos 9 hours ago [-]
> 10 year old dual Xeon server...On 10 year old hardware.
Hold on, what are the specs of your rig? How much RAM?
I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
hparadiz 9 hours ago [-]
I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.
I've been meaning to write a blog post but well whatever here's the md.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
jmuguy 12 hours ago [-]
Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.
Greenpants 12 hours ago [-]
Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
jmuguy 12 hours ago [-]
Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.
Greenpants 11 hours ago [-]
The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.
It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
lambda 12 hours ago [-]
If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
MrScruff 11 hours ago [-]
You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.
lambda 11 hours ago [-]
Which Opus? They certainly outperform Claude 3 Opus.
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
mapontosevenths 10 hours ago [-]
There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
lambda 9 hours ago [-]
OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
MrScruff 11 hours ago [-]
I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.
lambda 9 hours ago [-]
Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
shimman 4 hours ago [-]
Agreed, but at their current prices Deepseek + GLM are clear winners in my book. This weekend I spent $5 between the two where as I'd probably have to pay $20-30 to Anthropic (and that's still with the massive VC subsidies).
For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.
zozbot234 12 hours ago [-]
People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.
computerex 11 hours ago [-]
Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
lambda 8 hours ago [-]
Which Opus?
Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.
Opus has gotten vastly more capable since then.
Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.
Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.
jkells 7 hours ago [-]
Can't speak for anyone else but there was a step change in frontier models last November. Opus 4.5 and GPT 5.2 I think.
When I colloquially say Opus level I really mean Opus 4.5 or later
lambda 7 hours ago [-]
Right. Local models haven't quite hit that level yet. The biggest open models, which you need tens of thousands of dollars of hardware to run at reasonable speed, have pretty much hit that level of capability, but most models you can reasonably run at home aren't quite there yet. But given the gap, if local models keep improving, you'd expect to maybe see that level by this November.
zozbot234 57 minutes ago [-]
My understanding is that we could in fact run the largest models on "reasonable" home hardware by focusing on throughput rather than raw speed and having them do unattended inference in large batches. The big proprietary suppliers have no interest in this because their own incentive is to fill all the physical space available with top-performing hardware and doing huge amounts of inference as quickly as possible. A home user with limited hardware investment has very different constraints.
rvnx 12 hours ago [-]
To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
dotancohen 10 hours ago [-]
> you really need to know what you're asking, and be precise
Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.
Thank you.
Greenpants 10 hours ago [-]
I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
motbus3 11 hours ago [-]
Try deepseek V4 flash
nyxtom 11 hours ago [-]
Have you found that being much more spec driven helps guide it better?
timmit 8 hours ago [-]
I got a 48GB Ram MacBook, somehow I cannot even run a 20b model, I was suprised that you get 35b model locally.
klardotsh 7 hours ago [-]
4-5 bit quants would probably fit pretty well on your rig. Check HuggingFace for Qwen3.6-35B-A3B-MTP-GGUF [1]. They've also got a cool UI thing these days to help indicate which quants of a model will run on your hardware.
Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.
it might be worth trying oh-my-pi in your case as it claims to improve the edit calls by using a unique patching format.
GardenLetter27 12 hours ago [-]
Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?
lambda 12 hours ago [-]
The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
everforward 10 hours ago [-]
An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.
Greenpants 12 hours ago [-]
I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
amelius 11 hours ago [-]
Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.
nozzlegear 4 hours ago [-]
I use local LLMs on my Mac Studio to write and pass unit test suites in F#, among other boring project chores I don't want to do myself.
q3k 10 hours ago [-]
I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.
We truly live in the dumbest timeline.
krainboltgreene 5 hours ago [-]
> is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture
I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?
SoftTalker 4 hours ago [-]
I haven't logged in to LinkedIn or looked at it since a former employer demanded that everyone create a profile. So mine is now about 20 years out of date.
krainboltgreene 4 hours ago [-]
His is very up to date. Not everyone is you.
yieldcrv 11 hours ago [-]
> It gets into loops quite often
matches my experience and a deal breaker
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
Greenpants 11 hours ago [-]
I get that it's a deal breaker to some; it definitely requires patience.
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
kennywinker 11 hours ago [-]
Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.
animanoir 3 hours ago [-]
[dead]
nobody_r_knows 12 hours ago [-]
[dead]
horsawlarway 13 hours ago [-]
For personal use, yes.
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
rootlocus 13 hours ago [-]
2x RTX3090 are around $4400. Without any electricity costs or other parts, that's 3.6 years of $100/m claude.
overgard 11 hours ago [-]
Assuming the $100/m claude subscription is still around in three years.
booi 8 hours ago [-]
we will be lucky if it's still around in 3 months..
reddalo 10 hours ago [-]
[dead]
oofbey 1 hours ago [-]
I think there’s a reasonable argument that a burst bubble will cause prices to drop. Prices are very high because they’re trying to justify these trillion dollar valuations on IP alone. If that fantasy goes away then prices will fall down to just silicon and electricity, which looks more like Chinese model prices. Hard to say how it will play out but the direction isn’t obvious to me.
horsawlarway 13 hours ago [-]
Yes, today is not a great time to purchase hardware.
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
tracker1 10 hours ago [-]
If you're willing to go the AMD route, the AMD Radeon Pro R9700 definitely looks interesting for the price compared to NVidia.
felooboolooomba 9 hours ago [-]
Can we also run LLMs on Radeon?
lloyd-christmas 4 hours ago [-]
I run qwen 27B:Q4 @ 130k context at 50 t/s on a single R9700, and have a 7900XT that runs mellum 12B:Q8 as its subagent. R9700s do really well at low wattage and underclocking as well. It's designed to run at 300W, mine is throttled at 210W, and only had an 8% slowdown. If I had somewhere else to put my desktop in my house, I'd bump it up to 240W and there would be zero perf degradation.
freetonik 12 hours ago [-]
That's also years of top tier PC gaming, if you're into that.
augusto-moura 12 hours ago [-]
2x RTX3090 is extremely overkill for gaming, you can run any released game on earth on ultra for much less
davkan 36 minutes ago [-]
There is currently no gpu in production that can max out the largest and fastest displays in graphically demanding games. We have monitors that are the equivalent of two 4k monitors side by side and run at 240hz. I have a 5080 and have to turn down settings to get 60fps in cyberpunk.
drnick1 11 hours ago [-]
1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.
arcanemachiner 8 hours ago [-]
It's probably worth more now.
overgard 11 hours ago [-]
Having a second card doesn't really work well for gaming.
lowbloodsugar 6 hours ago [-]
I can’t run 4k HDR cyberpunk 2077 at 240hz with path tracing. I’m managing ~120fps. I’ve got a Blackwell 6000. I didn’t buy it for games, but there are still games and setups where the GPU is the bottleneck. I don’t even have an 8k TV.
googletron 12 hours ago [-]
what?
kakacik 12 hours ago [-]
AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.
Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
himata4113 12 hours ago [-]
You can have the 2nd card as an offload for upscaling, frame generation and whatnot.
irishcoffee 11 hours ago [-]
When I'm not running models I use the 2nd one in a pass-thru configuration to a windows vm for various things, usually gaming.
driverdan 7 hours ago [-]
If you pay $2200 for a 3090 you're a sucker. They're not worth anything close to that.
jmuguy 12 hours ago [-]
Or a really excellent experience playing Satisfactory with the settings cranked up, which is priceless.
fluoridation 2 hours ago [-]
Look in the used market, not new. There must some that can be had for much, much less than that.
matheusmoreira 8 hours ago [-]
Those GPUs can also play video games or mine cryptocurrency. They can also be sold later.
We should own things, not rent them. We should all do what we can to keep the fabled 2030 agenda at bay.
tripleee 12 hours ago [-]
Christ GPU prices have gotten crazy
How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
overgard 11 hours ago [-]
In my personal experience, I wouldn't bother with 16GB cards for coding -- the useful models are _slightly_ too large to work at any reasonable speed
lambda 11 hours ago [-]
That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
tracker1 10 hours ago [-]
You can get an R9700 with 32gb vram for ~$1200-1400 depending on where you live, which is probably a better option for AI use than 2x 9070(xt)
lambda 9 hours ago [-]
Yeah, definitely.
nyrikki 13 hours ago [-]
You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.
flowerthoughts 12 hours ago [-]
In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.
sieabahlpark 12 hours ago [-]
[dead]
kpw94 12 hours ago [-]
> gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models
Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
SubiculumCode 9 hours ago [-]
How is the the QAT models at coding? I looked for opinions since the release and haven't found much.
twothreeone 12 hours ago [-]
> unsloth/Qwen3.6-35B-A3B-MTP-GGUF
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
horsawlarway 12 hours ago [-]
I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
unethical_ban 11 hours ago [-]
I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
gonzalohm 13 hours ago [-]
Did you double the tokens per second by adding a second GPU or was the increase significantly less?
horsawlarway 13 hours ago [-]
No real change in inference speed. It basically just allows me to slot in more context or a bigger model.
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
mirekrusin 13 hours ago [-]
You’re adding extra gpu for more vram, not speed.
agup792 13 hours ago [-]
That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.
anhtqweb 10 hours ago [-]
Grocery list management and meal planning sounds interesting. Would you mind sharing a little bit more on your use case please?
ljosifov 5 minutes ago [-]
Not replaced but supplemented. For off-line coding current setup is pi + ds4-server + DeepSeek-V4-Flash REAP25 (on M2 Max 96gb). For simpler programming related (e.g. text2sql) as well as synthetic data generation, current best for me is llama.cpp + Gemma-4-26B-A4B (on gpu 7900xtx 24gb; sometimes nemotron-cascade-2-30b-a3b for 1M context). That and (dabbling now) auto-research uses lots of tokens. Used to get paused running out of token quotas all the time. The 1st local model I found somewhat useful to me was glm-4.7-flash, and it's gotten way better since. Recently between OpenCode Go choice of models at many price points, and DeepSeek-V4 dropping the IQ/$$$ by multiples, have become less reliant on local llms for this auxiliary work. Claude I use but with Zai GLM-5.2 subscription. And maintain GPT subscription for quality models.
bluejay2387 13 hours ago [-]
About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
heipei 12 hours ago [-]
Same here, I use Qwen 3.6 27b (Q6 quant) with llama.cpp on an RTX 5090 using the pi agent exclusively now. The fact that it's local means that I never have to think about token pricing, quotas, time of day, or data sensitivity. I have limited the GPU from 600W to 450W which means the system stays whisper quiet during inference.
I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
* "commit this on a branch, push, create a PR and assign $nickname for review"
* "Use the Stripe CLI to download all open and overdue invoices and reconcile them with this CSV export from our bank account."
* "Use these Elasticsearch credentials to summarise what kind of operations are causing load at the moment."
* "Tell me if our codebase already supports X and where it's implemented."
amarshall 9 hours ago [-]
What context length and kv cache quant (if any) are you using? And MTP?
lloyd-christmas 3 hours ago [-]
Not the person you asked, but I have a 9700 which has the same VRAM, and running Q6 on it with unquantized kv gives me 50k context. Putting -ctv q8_0 ups that to 70k. I normally run Q4 with unquantized kv @ 130k at 50 t/s (mtp 3), with the disclaimer that I'm running PCIe gen4x8, so I'm slightly slowed. I've found that quantizing k leads to broken json on tool calls, which is fairly unrecoverable, but YMMV.
bo1024 13 hours ago [-]
Qwen3.5-122B is actually Qwen3.5-122B-A10B. The A10B means that this is a "mixture of experts" model where only 10B parameters are activated at a given time. Whereas Qwen3.6-27B is a "dense" model where all 27B parameters are activated all the time. So for many tasks, you'd expect the 27B dense model to be better than the 122B-A10B model.
user43928 10 hours ago [-]
I am forced to use Qwen 3.6 27b at work and found it next to useless.
I might as well do all the work manually rather than having it implement another mess or get the debugging entirely wrong.
It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
sejje 10 hours ago [-]
It might be good at analysis & review, writing documentation, git commits, etc--even if it's not good at coding.
All the drudgery.
user43928 9 hours ago [-]
Bad AI written documentation and commits are not great, particularly when you work in a team.
I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
htrp 13 hours ago [-]
why 27b vs 35b? Is MoE that much worse for coding?
amarshall 9 hours ago [-]
Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7.
The trade-off of MoE is that it is worse but faster for the same total size.
electronsoup 11 hours ago [-]
Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram
codinhood 14 hours ago [-]
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
pyeri 13 hours ago [-]
At some point, there will come a saturation point for that "Opportunity cost FOMO train ride", and I think we are already past that point. Mythos class models are a whole different beasts and cutting edge on reasoning but not much use for the problem domains most developers are trying to solve.
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
codinhood 12 hours ago [-]
Yeah this is exactly what I'm waiting for.
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
kristopolous 38 minutes ago [-]
that's super contextually dependent. I use them just as essentially a decompress of what I already know that I'm doing. I legitimately use 4B models just fine. I've got a large number of tools that make this entirely feasible and a daily driver for me (like https://github.com/day50-dev/llm-manpage-tool) ...
It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.
phyzix5761 5 hours ago [-]
The opportunity cost to who? Its getting super expensive for businesses and engineers across the board to pay for frontier models.
mark_l_watson 11 hours ago [-]
Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
sakopov 12 hours ago [-]
This seems to be the answer. Building a rig with a decent graphics card will cost $2k+ and will produce sub-par results. Might as well milk the $100/m Claude sub until open-source alternatives reach parity with today's frontier models.
gunapologist99 9 hours ago [-]
Rather than Occam, consider Pareto?
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
MadrasThorn 12 hours ago [-]
It's great at accelerating hardware innovation however.
jrm4 13 hours ago [-]
But you're pretty much measuring opportunity cost in tokens per second, no?
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
codinhood 13 hours ago [-]
If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
jrm4 7 hours ago [-]
Having, e.g. seen Microsoft maintain a monopoly for well over a decade, there's nothing in my experience that suggests that "quality always beats hype" is remotely true.
It's entirely possible Claude is just winning the hype game.
Rastonbury 13 hours ago [-]
I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription
pierotofy 14 hours ago [-]
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
jacobgold 13 hours ago [-]
"Quality is like running edge models from 8-12 months ago."
That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
kelnos 2 hours ago [-]
Not sure what you mean by "primary driver", but I was finding even Sonnet quite useful for coding tasks, even about 12-14 months ago (I was too cheap to pay more than $20/month back then, and Opus hit my limits too quickly).
Certainly I get a ton more value out of Opus today, but I could absolutely see someone deciding to limit themselves to 8-to-12-months-ago Opus performance for privacy (or other) reasons.
sbrother 13 hours ago [-]
I strongly agree on that being the release where these tools got good enough to substantially speed up my professional work. I have to admit I was super skeptical of AI coding until then.
deaux 11 minutes ago [-]
Your skepticism led you to underrate the usefulness until then. Those who have been using agentic coding for the last 2 years can tell you Opus 4.6 was not a step change in quality, it was mostly a step change in the Overton Window and narrative.
dnautics 12 hours ago [-]
for me (might be because of the language im using) i had a substantial bump around september and a huge bump around January.
in my stuff now i use an OT library that claude put finishing touches on in September.
storus 7 hours ago [-]
You can already get Opus 4.6 level of performance on subtasks with some local models. So you need to pick a proper code writer, plan writer, code tester etc. model that matches your target expectations and use a coding tool that allows calling different LLMs for different subtasks. For example, people use StepFun 3.x or DeepSeek4-Flash for planning, Qwen3.6-27B for coding.
alexandra_au 4 hours ago [-]
You have your dates and models wrong, it was Opus 4.5 released in November 2025, that changed everything, Opus 4.6 was released in February 2026.
jacobgold 4 hours ago [-]
You're right. December is when things felt differnt but Opus 4.5 was actually released November 24, 2025.
So thalen it might be 6-8 months to get to useable on a local open model? Of course state of the art will be a year ahead, a generation at the current pace.
pierotofy 13 hours ago [-]
I use it for work.
jacobgold 13 hours ago [-]
That's cool if you prefer it, but it is hard to imagine it being a strictly rational choice when much better quality is available at a price that is small relative to the cost of an employee. Or is there something specific about your use-case?
vector_spaces 13 hours ago [-]
Not all work requires every facet to be so sharply optimized, and there may be other constraints that are completely invisible to you. Some that were easy for me to imagine: the parent works in a heavily regulated industry, their IT team is slow-moving and paranoid and this is a safe, under-the-radar workaround, the output is "good enough" for their purposes and they find tinkering with it to be fun.
Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
pierotofy 12 hours ago [-]
To me, what's not rational is believing you must rent the tools of your trade while exposing all of your employer's intellectual property to a third party. Difference of opinion.
jacobgold 12 hours ago [-]
It's not my opinion that you "must" rent tools but it certainly is the pragmatic choice in 2026. I would be as happy as anyone for this situation to change and I expect it to at some point.
lokar 13 hours ago [-]
Won’t it depend on what you use it for? A less capable system might be fine for boilerplate, moderate re-factoring, etc. Not everyone is building whole features in one go.
epolanski 7 hours ago [-]
Why don't you people bother to try instead of chasing the latest shiny thing?
You must be the type of crowd that writes websites with React and Tailwind and pretend to be engineers and have an opinion on everything.
trueno 13 hours ago [-]
i have a 128gb m4 max macbook pro i've been wanting to tinker with this stuff but genuinely never find the time. any mac users in here running similar to the above that can share their experience?
i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
brycesub 13 hours ago [-]
If you have a 128GB Mac you really ought to try out: https://github.com/antirez/ds4 by the creator of redis. This is probably as close to it gets to state-of-the-art local LLM + agentic coding.
__mharrison__ 10 hours ago [-]
Using this just this morning on my DGX Spark. A little slower than frontier models but my $200/mo weekly usage exhausted with 3 days left on the week...
(Shouldn't have done that refactoring job in high mode)
trueno 10 hours ago [-]
well this is supremely interesting thanks for putting it on my radar
lostlogin 12 hours ago [-]
Thank you.
htrp 13 hours ago [-]
Use your ClaudeCode sub and tell it to set it up for you
dirkolbrich 10 hours ago [-]
I have the same machine. You might look into https://omlx.ai/ a „macOS-native MLX server“. pi.dev for the agent with MCP, web-search and sub-agents extension.
atomicnumber3 13 hours ago [-]
Same. I have no desire to use Claude at all anymore.
pierotofy 13 hours ago [-]
Yep. Screw Anthropic, CloseAI and all other rent seekers in this space.
akulbe 12 hours ago [-]
I have an M2 Max MBP with 96GB of RAM. What models and setup would you use for this kind of configuration?
monirmamoun 11 hours ago [-]
download LM Studio to play with, and it will let you search for models... try Qwen3.6-35B-A3B at 4,5 or 6 bits (6 bit XL is near perfect) and use pi coder or another harness to access it... you can also try Unsloth studio and try same model to start. LM Studio slighter easier to use, Unsloth probably better quality. Neither one is super great quality by the way (meaning: they crash or act weirdly too often to be full production solutions, but can work for local coding). ONCE YOU DOWNLOAD EITHER APP... it will let you search huggingface for the models. Just type qwen to start looking and ... start messing around. And you connect the pi coder harness using the http interface that LM Studio and Unsloth offer to the engine API, so make sure you figure out that url and turn it on... something like 127.0.0.1:1234/api would be a typical IP (localhost) and port (1234 is used by LM Studio)
daveidol 13 hours ago [-]
Do you do your dev work on the windows machine (referenced in the docs), or do you remotely access it from a separate machine? I ask because I have a RTX 3090 kicking around in a gaming desktop, but I don't use it for any dev work (I use a Macbook Pro).
snake_n_my_boot 11 hours ago [-]
I have a similar set up and have been using it to learn and tinker with open models. I run Ollama on the gaming desktop and point OpenCode to it from my MacBook. Works nicely for me so far.
lelandbatey 13 hours ago [-]
I use it, it's good, I get work done, but know that they really mean it when they say
> "Quality is like running edge models from 8-12 months ago"
Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
dheera 13 hours ago [-]
Am I doing something wrong or has ollama become shittified?
I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
I thought the whole POINT of ollama was not-cloud?
satvikpendem 13 hours ago [-]
Ollama is not recommended to be used. Use llama.cpp.
The larger models are available on Ollama's cloud as most folks don't have the hardware to run 500B-1T parameter models.
jubilanti 11 hours ago [-]
> I thought the whole POINT of ollama was not-cloud?
It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...
toyg 13 hours ago [-]
Yes, you've nailed it. Ollama are desperately trying to pull a Cursor - like 3791 other projects in this space.
dominotw 13 hours ago [-]
how much does the setup cost if i want to buy all the hardware now and increased power costs?
goranmoomin 1 hours ago [-]
I'm not using my models locally, but the majority (80% or more) of my coding agent sessions run on open source models, i.e. DeepSeek v4 Pro and Kimi K2.6 with thinking.
A point that I haven't seen come up a lot, but is very valuable to me is that for open source models, I can select the inference provider myself (even if it's not a local GPU), which means that I can enjoy superb speed (i.e. 300 tok/s) while still spending much less than the big providers.
My experience is that if you were fine with the coding models of yesterday (i.e. Claude Opus from Jan/Feb of 2026), you will be fine with either Kimi K2.6 or DeepSeek v4 Pro. Kimi is a bit more smart but has only 256K context and the performance deteriorates (and sometimes just gets stuck) when it fills up the context window. DeepSeek v4 has a 1M context and performs just as well with much less issues. And they both generate very idiomatic code, gives the same vibe of Opus a few months ago.
Since it's also fast (and does not fixate on trying to fix impossible problems, unlike the recent Opus/GPT 5.5 models), a big benefit is that you still control and steer the coding agent and you won't be losing focus like the major models. They are smart, but they don't fixate as much on trying to do stupid things, and since it's fast, you can just interject. It's a much more pleasant experience than the latest models.
I still use the latest models time to time when I expect the agent to fixate all of the problems and figure out everything themselves, but for me open source models are like 80~90% of all of my sessions.
sosodev 14 hours ago [-]
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
argee 13 hours ago [-]
I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.
I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
user43928 10 hours ago [-]
My experience with smaller models, in this case specifically GPT 5.4 Mini, is that they cannot two-shot moving a 10-20 line code change to another file without modifying it and introducing bugs.
I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
what 6 hours ago [-]
Is it not faster to just do that move yourself instead of asking the clanker to do it?
nake89 52 minutes ago [-]
I have an RTX 4060 12gb vram. Qwen3.6 35b. I stopped paying for Github Copilot. But I wouldn't say I replaced frontier models with a local one. I still have some dollars in my openrouter when I need to. Also to get interactive agentic coding speeds I need a high tps. So my quant is very small. And I would say a coding harness that is fully extensible is a must to create fully custom workflows tailored for low specs. I use pi (not perfect, still found some hard coded, non-extensible parts)
Kostic 13 hours ago [-]
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
fitzn 5 hours ago [-]
What extension do you use in vscode to connect it to local llama.cpp? Or do you auth with github copilot and then point to localhost? Or something else?
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
akersten 13 hours ago [-]
> I have 2x RTX Pro 6000 Blackwell
Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
arjie 10 hours ago [-]
I run a small business (https://technologybrother.com) that runs a few small SaaS so I ordered the GPUs through corporate sales. If the barrier is getting an LLC, those are relatively cheap. The nice thing is that if you've got a legitimate business with use for GPUs you can get into the Nvidia Inception Program which has a pretty solid discount.
zackify 6 hours ago [-]
Microcenter is the easiest place but almost any vendor will sell to you after you email them and if you have an LLC
No affiliation, I've just ordered from them a few times.
leptons 14 hours ago [-]
Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.
ux266478 13 hours ago [-]
Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
leptons 7 hours ago [-]
I'm paying about $0.19/hr and using half that power just for a large spinning RAID, running some VMs and security cameras. And I'm reconsidering my digital extravagance because of the electric bill. You probably make way more money than I do.
mtone 9 hours ago [-]
Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:
- Prefill: ~10K tok/s
- Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
- GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.
I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.
Electricity-only (@ USD $0.08/kWh)
Usage | IN price | OUT price | Monthly cost
Concurrency=1 | $0.040/M | $0.080/M | $8.65 to $38.88 (5% to 100% active)
Concurrency=4 | $0.024/M | $0.044/M | up to $48.67 (cheaper per token but higher power draw)
Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?
A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):
IN price OUT price
Self-hosted $0.121/M $0.363/M
OpenRouter (budget) $0.098/M $0.196/M
OpenRouter (DeepSeek) $0.140/M $0.280/M
B) Breakeven subscription (users active ~1.5h/day):
Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.
mtone 3 hours ago [-]
I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477
There may be a way to get the 2-bit quantized version running even faster on a pair of them.
stymaar 13 hours ago [-]
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
manmal 13 hours ago [-]
Have you tried the 27B dense version? It’s way better for coding.
anana_ 12 hours ago [-]
Unfortunately on Strix Halo or any similar unified memory set up, dense models are gonna be dirt slow due to the tiny memory bandwidth... But I agree, 27B is superior.
stymaar 12 hours ago [-]
Exactly. That's why I'm disappointed there wasn't a 122B version, it's 27B but for Strix Halo users.
garethsprice 11 hours ago [-]
Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
wsintra2022 8 hours ago [-]
Reading through these comments, I can't tell any more whats bots posting on behalf of the AI providers trying to dissuade or whether people just have had negative experiences with local ai models.
IMO, Qwen 3.6 27B 8k quants running on a Mac Studio 64g ram, incredible?. No it is not frontier general super shit, its just good. That's it, its good. Its free and private and can take an experienced engineer from being lazy to being really lazy, and that's magic right there. I use llama.cpp and opencode and have great moments of planning some code changes, and letting it run. Walk away. Chill in the hamoc, clean the dishes, have a wank, whatever. Use tmux and ssh in and check in on it. THIS is where the incredible comes in. Anyone telling you otherwise, well check their motives. I have no skin in the game. I just have an easy lazy time.
epolanski 7 hours ago [-]
The software "engineering" field is filled with MIT Leetcode ninjas writing React+Tailwind memory leaking unusable slop, the bar is extremely low.
jodoherty 12 hours ago [-]
I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
jborak 12 hours ago [-]
I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
zakisaad 9 hours ago [-]
This is interesting to me - why'd you go with the 5070 for your 4x build?
At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.
(I run a 5070 in my desktop)
HappySweeney 15 hours ago [-]
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
cuttysnark 13 hours ago [-]
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
pianopatrick 12 hours ago [-]
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
sowbug 11 hours ago [-]
Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
mgsram 9 hours ago [-]
I have been using local LLMs for about a year and I have settled now on Qwen3.6 27b dense model in GGUF on Mac Studio with 512G of RAM with open code as the harness and llmster(LM Studio). I have also used the Qwen 3.6 35B-A3B but the dense model's accuracy is next level with the tradeoff being tokens/sec. With the Qwen3.6 27b, I usually get anywhere from 25-40 tokens/second. Initially I used them for simple tools but for the past 3-4 months, I have been actually doing production grade coding in C/C++ (Automotive Software stack) and Python (Tools) with Qwen3.6 27b.
The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude Code
Some of my recent work using Qwen 3.6:
1. Complete rewrite of Power management Service in C using the existing C++ code as reference
2. Tool to parse contents from really complex specifications in Excel format
3. Tool to translate CJK contents to english for feeding into KG
I have tried plenty of other models with full FP32 as wel. However, in terms of balance between accuracy and speed, I found the Qwen 3.6 27B to be the sweet spot.
zftnb666 4 hours ago [-]
I replaced Claude with DeepSeek V4 Flash via API. Not local, but 95% the quality at 5% the price. Close enough.
GodelNumbering 12 hours ago [-]
As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.
blurbleblurble 14 hours ago [-]
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
coder543 12 hours ago [-]
I agree completely.
It's also annoying that OpenCode doesn't even try to support local LLMs properly.
Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
zackify 6 hours ago [-]
You have to try pi.dev you can already make it do anything you want. I use opus to customize and tweak parts of it. Its the best harness due to the entire thing being api driven for customization
wsintra2022 8 hours ago [-]
Not my experience at all. Mac Studio 64g, running Qwen2.7b 8K. Took ten minutes to get up and running, just read some documentation, Unsloth literally walks you through it. For Opencode just edit one file and its good to go. Have not had any issues (besides the occasional LLM related one). Not extremely manual and clunky at all.
horsawlarway 13 hours ago [-]
Pi is decent.
I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
Pi is... just fine.
It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
Which is something that all the other providers charge you api access rates for (ex - thousands a month).
Insanity 14 hours ago [-]
Heard good things about pi.dev but haven’t tried it. It might take care of some of those missing features you mentioned.
bityard 13 hours ago [-]
pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.
horsawlarway 13 hours ago [-]
I mean - the base experience is just fine, with perfectly reasonable built in tools for file access and editing, plus bash.
But yes - it expands a lot if you're willing to play with it.
I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
cheekygeeky 13 hours ago [-]
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
pianopatrick 12 hours ago [-]
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
bravetraveler 12 hours ago [-]
I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.
acc_297 14 hours ago [-]
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
htrp 12 hours ago [-]
Cursor is doing that (i think with Fireworks as their provider)
I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.
About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
grmnygrmny2 12 hours ago [-]
Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
patates 2 hours ago [-]
I have a mac with loads of ram but I cannot even justify the electricity cost when deepseek is so better than anything I can run locally (including heavy quantizations of deepseek itself) and costs pennies. It's crazy how cheap it is!
CuriousRose 6 hours ago [-]
An equally important issue with local AI use (not coding specific) is ensuring that the harness has fast and up to date data if recency is important in your querires (new package features, docs, etc). Hosted models do web search incredibly well and I think this is a huge part of output quality.
I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.
On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.
- Firecrawl - fast web scraping
- SearxNG - metasearch
- CloakBrowser - tursile bypassing Playwright alternative
If you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
K0balt 14 hours ago [-]
Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
kadoban 14 hours ago [-]
What tool do you use to drive things for you, out of curiosity?
K0balt 4 hours ago [-]
I use Claude code. You can use it with any model you want to.
kandros 14 hours ago [-]
I’d rather ask my butcher than Haiku for coding tasks
K0balt 3 hours ago [-]
I’d say when qwen works it works like sonnet, when it fails it fails like haiku. So it’s less consistent but works pretty well, I guess? It’s still overall pretty useful for a lot of stuff, and I can run it directly on my MacBook. Once you get an idea of what it can and can’t bite off, it’s pretty easy to break things into chunks it will handle reliably with grace. But I still like to have access to SOTA models for review. Also you can have a SOTA model write a development plan that is basically a bunch of prompts to generate each part, then have the local model follow the plan.
I should mention not to run it at less than q6, I prefer q8.
papichulo4 13 hours ago [-]
Agreed on this. Anthropic has now changed the verbiage on the definitions of the models under `/model` to say that Opus is for everyday usage, and Sonnet is for routine tasks.
There's apparently a reason Sonnet and Haiku have been left in previous version #s.
Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
12 hours ago [-]
henrixd 6 hours ago [-]
I have been heavily relying on Qwen3.6-27B-UD-Q4_K_XL.gguf -model and Pi agent (https://pi.dev/) for local tasks and coding. I have used llama-cpp-turboquant fork with some custom cherrypicked MTP patches from another fork.
I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.
I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.
_bobm 10 hours ago [-]
But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
XCSme 9 hours ago [-]
> The SOTA models are a deep orchestration of multiple models operating together it isn't a single mode
I don't understand, why does it make you think this is the case?
> how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself
Can you give an example?
_bobm 9 hours ago [-]
> Can you give an example?
Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
There is pretty heavy orchestration.
> I don't understand, why does it make you think this is the case?
Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
As I said, if you observe the output from these api endpoints you will notice it.
XCSme 8 hours ago [-]
> You will notice multiple "thinking" parts per "turn"
I thought that was the code harness simply minifying the outputs.
Many models now no longer return the entire chain-of-thought (to avoid distillation attacks). So yes, we don't get the raw LLM output, but I think it's just the thinking summarized, not a complex orchestration or different models.
I do agree though that now cloud models are kind of a black box, that's not only obfuscated but also changes over time. Companies seem to be changing model capabilities without notifying users, or even hiddenly serving completely different models. This is even worse via OpenRouter, with providers serving open-source models, some of them serve heavily quantized versions or even completely different models.
_bobm 7 hours ago [-]
idk what is "minifying outputs" in the context of what we are talking about. Opencode is opensource, you can find out what it is doing.
Last time I checked, OpenAI even send (in the response) the summary of the thinking part alreafy in markdown, so opencode has to remove the formatting to format it to their liking.
> Many models now no longer return the entire chain-of-thought (to avoid distillation attacks).
This is what they say: to avoid distillation attacks. And to some large extent this is true. I am saying there is a side- effect and this side- effect (depending on how tin-foilly you want to go) may be either a nice thing to have or it may be the "main reason" for all of this.
The side effect is splicing the inference, brokering requests, and what not, which brings huge benefits at scale.
This was my original point: openweights model to a sota model may be apples to oranges. So when will a local model catchup with its single cot run which is not even shaped properly: well never.
It is apples to oranges.
XCSme 7 hours ago [-]
So, are you saying that local models are maybe better than we give them credit? Because with some extra orchestration/processing we could improve the results?
_bobm 6 hours ago [-]
Yes, local models have already all that is needed, they have all the prerequisites.
But what they do not have is the correct shape, the correct approach. This is missing and it shows on multiple scales: it shows in the COT, it shows in the output itself, it shows in the infra to serve the models, it shows in the model orchestration.
This is what anthropic said one year ago:
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
JSR_FDED 2 hours ago [-]
This all sounds very mysterious
_bobm 2 hours ago [-]
Yes, but it isn't.
3abiton 7 hours ago [-]
I think nearly everyone mentioned Qwen, so my turn I guess. Qwen 3.6 35B Q8 (MTP), on a Strix Halo, with llama.cpp. Around 40-50 t/s. Really great pefromance, I get always suprised by its capability. I used with forge-code directly in zsh. For long context 150k+) it start degrading and forgetting.
mitchell_h 14 hours ago [-]
Tried. The context windows just weren't big enough.
coder543 12 hours ago [-]
Qwen3.6-27B supports a 1 million token context window.
Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
lysace 14 hours ago [-]
Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.
tobyhinloopen 13 hours ago [-]
Try it on RAM and CPU.
It’s slower but you can run them.
lysace 13 hours ago [-]
Good idea for evaluating the models, thanks.
deadbabe 14 hours ago [-]
Prompt more directly instead of open ended.
moezd 12 hours ago [-]
Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.
amarshall 9 hours ago [-]
Thinking doesn’t change output speed. Anthropic’s models are ~ 40–60 t/s median output speed.
jrflo 3 hours ago [-]
I would love to do this if it didn't require such a huge amount of RAM. And the difference in quality is worth it to pay $20-$100/mo if data retention doesn't matter to you.
heisenbit 10 hours ago [-]
I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
ozten 4 hours ago [-]
Yes, for client projects where privacy and security is important, but no enterprise contract:
Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.
I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.
bijowo1676 12 hours ago [-]
One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etc
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
pdyc 3 hours ago [-]
yes
harness - pi+custom extension for subagents
model - qwen3.6 35ba3b q4km
hardware - intel arrow lake with 32gb ram
server - llama.cpp vulkan
performance - 15-18t/s generation 50-150t/s pp
planning and task creation is still using claude/gpt but they dont touch the code. All coding is done using this setup.
Example of project made using this setup easyanalytica.com , its of medium size complexity
12 hours ago [-]
milchek 9 hours ago [-]
I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.
carlossouza 4 hours ago [-]
This should be a recurrent question posted every month
SupLockDef 12 hours ago [-]
Local isn't new for me. I am still coding my stuff, but Qwen3-coder:30b on my old rig with a gtx 1070 16gb RAM does wonders for me.
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
jboss10 10 hours ago [-]
Have you tried qwen3.6 or pi?
SupLockDef 9 hours ago [-]
3.6 is too slow on my old rig for some reasons, so I went back to qwen3-coder.
I did try 3.6 on my main desktop. It was good, but I didn't see much differences than coder, so I am still using my old rig.
I have not. We use openspec with our projects at work. To try and simulate a local rig without spending big cash. I use the hosted models and pay for them with the latest popular local model.
Most small local models don't get tool calling right, however the larger models are now doing this correctly now.
One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.
Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.
BiraIgnacio 13 hours ago [-]
I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
zaptheimpaler 13 hours ago [-]
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
Rzor 3 hours ago [-]
So there was a problem with gemma 4 when it comes to tool calling that Google apparently fixed like 2 or 3 days ago. I remember reading something about this.
anana_ 12 hours ago [-]
Perhaps try a different model? Just from anecdotal experience, I find that the Gemma models smaller than 31B do not tool call as often as they should.
Some of the benchmarks appear to back this up [0]
Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.
I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
trueno 13 hours ago [-]
we keep moving the goalposts on when we're gonna be happy with local. first it was sonnet at home as the good enough, then opus, now it's the mysterious leading model that runs on infrastructure we can't feasibly have at home
boringg 13 hours ago [-]
Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
snoman 9 hours ago [-]
If the government is going to gate access to frontier models from here on out, even if new releases are a step function change… which they’re not… then it may be even more comparable to what’s available with a subscription.
v3ss0n 6 hours ago [-]
Yes qwen 3.5 122b+ dgx is working wonders and I ko longer subscribed to any cloud api now.
I will post a project which I accomplished in 9 days of long horizons running.
sj_tech 8 hours ago [-]
I use Qwen 3.6 35B A3B for agentic coding using GitHub Copilot Extension for VSCode. Mac Mini 128GB as the hardware. Seems reasonable for that model size, but I notice looping issue when problem becomes too big to solve. You can use it to do something that you know how to do (saves time).
dabinat 14 hours ago [-]
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
rvnx 12 hours ago [-]
I start to believe that adding more and more and more and more and more thinking tokens is the hack that works (this is what gave birth to Fable)
utopiah 36 minutes ago [-]
Why would you not think that?
It seems pretty intuitive that pouring more resources into a problem (more GPU, bigger GPUs with more VRAM, bigger datasets, better curated datasets, more efficient ways to train, more efficient way to run inference, etc) then running the result for a longer time, with more layers of verification (running in VMs, model fusion comparing multiple models, having harnesses with testing) will at least lead to marginally better results.
Is it worth it and at what pace will it keep on improving are different questions but I have little doubt that if the industry keep on pouring resources, sure more "works".
ndom91 12 hours ago [-]
Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
bArray 12 hours ago [-]
I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.
I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.
russelg 6 hours ago [-]
I've got the same spec, are you running the 27B or the 35B-A3B? I found the 27B was unusably slow (like 10-15t/s not to mention the prefill times)
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
riazrizvi 13 hours ago [-]
All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.
porkloin 13 hours ago [-]
I have good results with this setup:
Hardware:
- GPU: AMD 7900xtx, 24gb vram
- CPU: AMD 5950x, AM4
- RAM: 64gb DDR4 3600
Software:
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
ryandrake 12 hours ago [-]
Now that's what I'm talking about! Very cool, thank you for the detailed response.
drnick1 8 hours ago [-]
- What would you say is the best model for coding at the moment that can run on a high end consumer GPU? (Assume an RTX 3090/4090 is available.)
- What "stack" do you recommend? Llama.cpp + OpenCode?
xhinker2 12 hours ago [-]
Yes, I have.
1. Two RTX 3090s in Linux 22.04
2. Running Qwen3.6-27B Q6_K_XL GGUF
3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine
4. Many times it solve problem that Codex can't solve
I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.
cmrdporcupine 5 hours ago [-]
I was just looking and it should be possible to run this one on 3bit quant on my single Spark? Maybe? Depending on context size? Assuming 3-bit doesn't totally lobotomize it.
whartung 12 hours ago [-]
Will the inevitable M5 releases from Apple change this equation in any meaningful way?
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
627467 11 hours ago [-]
So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
Lwerewolf 14 hours ago [-]
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here:
https://github.com/antirez/ds4
...with a bit less than half that at "low power" (30w). Both are usable.
codelion 5 hours ago [-]
Using qwen3.6 27b locally with Claude code, it works well for simple coding tasks
qu0b 10 hours ago [-]
I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
julianlam 8 hours ago [-]
Of course.
Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.
Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.
overgard 11 hours ago [-]
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
kristianpaul 11 hours ago [-]
Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
jmward01 12 hours ago [-]
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.
shironnnn_ 12 hours ago [-]
I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
ecshafer 14 hours ago [-]
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
mark_l_watson 12 hours ago [-]
I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
fortyseven 14 hours ago [-]
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
anuramat 11 hours ago [-]
I wonder what languages people are using; I imagine smaller models would be decent at bash/python but significantly worse at something like rust
hegdeezy 13 hours ago [-]
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
redox99 13 hours ago [-]
Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
pbasista 12 hours ago [-]
> Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5.
Is that characterization based on some objective facts or benchmarks?
kube-system 12 hours ago [-]
Yes, there aren't any 35B models that are beating frontier models at just about anything generalized
redox99 12 hours ago [-]
Based on private test prompts I've run through OpenRouter.
xgulfie 11 hours ago [-]
I don't need a Ferrari to get to work
orangeisthe 10 hours ago [-]
But you need the best tools to do the job
cayley_graph 7 hours ago [-]
You need tools sufficient to do the job in an economical way, optimizing for both cost and quality. That is what 'best' means. We don't give every engineer all the resources under the sun, only what is appropriate.
I suspect many will realize millions more dollars are being spent than needed to achieve the highest marginal productivity gains, and reallocate accordingly. Who wants more of their money going to developer tooling, rather than bonuses?
orangeisthe 2 hours ago [-]
Of course. I have a $20/mo Codex subscription that has been serving me very well. Occasionally when I run out of quota, I switch to another one of my backup $20/mo subscription.
That's way more economical and produces far better result than any self hosted models today.
_davide_ 14 hours ago [-]
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
sosodev 14 hours ago [-]
My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.
I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
_davide_ 12 hours ago [-]
you can absolutely use it for some workloads, but as soon as you have some extra complexity for a big repo it'll take forever and the economics are so silly to the point that the electricity bill would be comparable to a subscription. I love having the possibility of running things locally if some random dude decide to pull them plug, and give me solice the fact that i can have 100% private inference, but as the main driver during the day? shoot me
agentbc9000 9 hours ago [-]
Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.
bentt 9 hours ago [-]
Say more!
catapart 10 hours ago [-]
tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
Rzor 3 hours ago [-]
RX 9060 XT 16GB here on google/gemma-4-26b-a4b-qat using LM Studio. Context 65k, 23 layers on the GPU, 7 on the CPU, model in memory, mmapped. I'm getting 23-33 tks. Started experimenting 3 days ago (with gemma-4-e4b), don't know what half those settings mean, but 26B, even quantified, feels significantly better at a few small projects I asked it to create ("create a image converter using ffmpeg in bash", "create a canvas animation with real physics, no libraries"[1]).
It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.
Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
xeonax 13 hours ago [-]
Whats .NET doing in between?
AH4oFVbPT4f8 11 hours ago [-]
Sorry, I meant to say I was writing .NET C# with the setup
SugarReflex 8 hours ago [-]
Is anyone using Aider?
Is there any decent CLI alternatives to it?
SkitterKherpi 14 hours ago [-]
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
chungus 9 hours ago [-]
Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr
llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
devmor 2 hours ago [-]
I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.
The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.
euroderf 11 hours ago [-]
Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.
jwr 13 hours ago [-]
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
jmichaelson 13 hours ago [-]
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
Related: Are there any viable distributed AI models?
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
joshuamoyers 14 hours ago [-]
I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.
13 hours ago [-]
SimianSci 13 hours ago [-]
This is unlikely to happen in any meaningful fashion for quite some time.
(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense.
I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
The dream of distributed AI is a ways off.
salutonmundo 8 hours ago [-]
it's called your damn brain.
wmedrano 12 hours ago [-]
No, but I use GLM5.1 instead of Claude/GPT.
drnick1 11 hours ago [-]
Do you recommend Ollama or bare llama.cpp?
jboss10 10 hours ago [-]
llama.cpp It's faster and more open source. Ollama has some mixed history. I use llama-swap to emulate the Ollama experience.
shironnnn_ 11 hours ago [-]
if on MacOS I recommend llm-mlx which currently renders tokens 10%-15% faster than llama.cpp.
lasky 1 hours ago [-]
for crying out loud... why would you deprive yourself?
devin 13 hours ago [-]
Anyone here running a tinygrad?
lowbloodsugar 6 hours ago [-]
If you want to try it out before dropping $$$ on a GPU, just run something that would fit on your target GPU but online.
system2 14 hours ago [-]
Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
ColonelPhantom 8 hours ago [-]
Which model class requires an 80 GB VRAM GPU? From my perspective, popular models seem to be either in the ~30B range (Qwen3.6, Gemma 4), while the larger models (MiniMax, MiMo, StepFun, Deepseek) are in the multiple hundreds of billions parameters, for which 80 GB is simply too small.
You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.
system2 45 minutes ago [-]
Video models.
CamperBob2 2 hours ago [-]
This is true. There's not much point in buying only one RTX 6000. You need at least two to run anything interesting that you couldn't run on a 5090. And you can imagine where it goes from there.
w10-1 12 hours ago [-]
I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
I do qwen3.6 on an amd ai max laptop getting about 6-10tok/s it’s slow enough that I can follow along.
It has issues with design and large piles of code.
Otherwise it’s a good programming buddy.
platevoltage 7 hours ago [-]
I run very small models locally for code completion and writing boiler plate. I still use Claude in a web browser on occasion since it's free, but the second that goes away, I'll be done with it. They get none of my money.
epolanski 8 hours ago [-]
Not with a local one, but I moved to DeepSeek v4.
Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.
Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.
sometimelurker 11 hours ago [-]
yeah I use one one the small MTP qwens and pi
major505 12 hours ago [-]
Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.
Der_Einzige 11 hours ago [-]
Secretly the problems many people have with agentic coding are related to poor choice of sampling settings, but the world will wait several more years before this is understood well. top_p and top_k are garbage but they are intentionally kept on purpose because subsequent methods enable coherent high temperature sampling, which is an absolute no go for alignment/safety reasons.
The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
jay_kyburz 8 hours ago [-]
Can anybody let me know how just chatting with Qwen3.6 on a Strix Halo 128GB
If I give it a page of context, can it write a linked list or identify a bad line of CSS?
Is there anywhere online I can chat with a model I could be running at home to see how good it is?
thrownaway561 11 hours ago [-]
I just use DeepSeekV4 Fast... It's cheap as hell. Currently my monthly usage has been
67M Ouput
51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
ThomasGlanzmann 10 hours ago [-]
I do the same. deepseekv4 fast for the 90% of the tasks, if it can't lift it, I use deepseekv4 pro. I use crush as coding agent but removed the blocked commands because I also do a lot of system administration. Love it. I use 8 USD in 7 weeks and use it quiet extensively for all sorts of things, programming, system administration, google search replacement, investments, you name it.
codemk8 8 hours ago [-]
You mean deepseek-v4-flash, right? Same here. I use it for my Hermes agent. It's so cheap that I sometimes feel "guilty". I even put more money than I needed just make sure they do not go out of business.
ThomasGlanzmann 2 hours ago [-]
Yes, I do mean deepseek-v4-flash.
jeffrallen 12 hours ago [-]
I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.
gigatexal 13 hours ago [-]
I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
syngrog66 7 hours ago [-]
pre-replaced it with combo of my brain, vim, an assortment of other CLI/TUI tools, etc
cyanydeez 12 hours ago [-]
never started. using wither qwne3-xoder-nezt or qwen3.6 35b
if youre shoopping for a new pc, very easy to justify 128gb vram
dude250711 14 hours ago [-]
Yes, running a local model on a natural wetware substrate here.
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
jasongill 14 hours ago [-]
I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.
bananadonkey 10 hours ago [-]
My sub agent has been looping for almost 10 years at this point and has so far written 0 lines of code. Definitely won't be investing in another...
HPsquared 14 hours ago [-]
I personally get about 50 tokens per hour.
kordlessagain 5 hours ago [-]
[flagged]
daischsensor 4 hours ago [-]
[flagged]
HardAnchor 3 hours ago [-]
[flagged]
Littice 6 hours ago [-]
[flagged]
arggjarvs 5 hours ago [-]
[flagged]
aplomb1026 10 hours ago [-]
[flagged]
hottrends 8 hours ago [-]
[flagged]
KaiShips 12 hours ago [-]
[flagged]
8 hours ago [-]
codelong888 2 hours ago [-]
[dead]
phlhar 14 hours ago [-]
[dead]
temilson 14 hours ago [-]
[flagged]
eugmai86 12 hours ago [-]
[flagged]
ericmaciver 11 hours ago [-]
[dead]
iluvcommunism 15 hours ago [-]
[dead]
aiexpo_app 2 hours ago [-]
[flagged]
tyingq 12 hours ago [-]
Anyone doing it with a "rent a GPU over the network" path? Is that at all cost effective for any use case?
kertoip_1 15 hours ago [-]
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
dada216 14 hours ago [-]
Local? No.
Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.
True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).
I'll have to give the preserve_thinking a shot.
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
https://sebastianraschka.com/llms-from-scratch/ch04/08_delta...
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
What does this mean in June 2026 wrt coding?
To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.
Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.
You can. You just don't want to. Huge difference.
It's what I use. Fixes the problem
https://github.com/day50-dev/petsitter
I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!
One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.
There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!
Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.
All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
https://github.com/DeepBlueDynamics/nemesis8
I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?
I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.
That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.
The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.
Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.
Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.
I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.
It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.
*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.
that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.
Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.
Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).
So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.
https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...
One thing I did change was the context length to 256k rather than 64k.
So there's this really amazing program called "man"
Hold on, what are the specs of your rig? How much RAM?
I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
I've been meaning to write a blog post but well whatever here's the md.
https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...
Qwen3.5 9B performed best.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193
Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.
Opus has gotten vastly more capable since then.
Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.
Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.
When I colloquially say Opus level I really mean Opus 4.5 or later
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
Thank you.
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.
[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
We truly live in the dumbest timeline.
I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?
matches my experience and a deal breaker
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
We should own things, not rent them. We should all do what we can to keep the fabled 2030 agenda at bay.
How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")
- https://blog.google/innovation-and-ai/technology/developers-...
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
All the drudgery.
I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
The trade-off of MoE is that it is worse but faster for the same total size.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
It's entirely possible Claude is just winning the hype game.
That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
Certainly I get a ton more value out of Opus today, but I could absolutely see someone deciding to limit themselves to 8-to-12-months-ago Opus performance for privacy (or other) reasons.
in my stuff now i use an OT library that claude put finishing touches on in September.
https://www.anthropic.com/news/claude-opus-4-5
Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
You must be the type of crowd that writes websites with React and Tailwind and pretend to be engineers and have an opinion on everything.
i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
(Shouldn't have done that refactoring job in high mode)
> "Quality is like running edge models from 8-12 months ago"
Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
I thought the whole POINT of ollama was not-cloud?
It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...
A point that I haven't seen come up a lot, but is very valuable to me is that for open source models, I can select the inference provider myself (even if it's not a local GPU), which means that I can enjoy superb speed (i.e. 300 tok/s) while still spending much less than the big providers.
My experience is that if you were fine with the coding models of yesterday (i.e. Claude Opus from Jan/Feb of 2026), you will be fine with either Kimi K2.6 or DeepSeek v4 Pro. Kimi is a bit more smart but has only 256K context and the performance deteriorates (and sometimes just gets stuck) when it fills up the context window. DeepSeek v4 has a 1M context and performs just as well with much less issues. And they both generate very idiomatic code, gives the same vibe of Opus a few months ago.
Since it's also fast (and does not fixate on trying to fix impossible problems, unlike the recent Opus/GPT 5.5 models), a big benefit is that you still control and steer the coding agent and you won't be losing focus like the major models. They are smart, but they don't fixate as much on trying to do stupid things, and since it's fast, you can just interject. It's a much more pleasant experience than the latest models.
I still use the latest models time to time when I expect the agent to fixate all of the problems and figure out everything themselves, but for me open source models are like 80~90% of all of my sessions.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
No affiliation, I've just ordered from them a few times.
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
Electricity-only (@ USD $0.08/kWh)
Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):
B) Breakeven subscription (users active ~1.5h/day):There may be a way to get the 2-bit quantized version running even faster on a pair of them.
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md
https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.
(I run a 5070 in my desktop)
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude Code
Some of my recent work using Qwen 3.6: 1. Complete rewrite of Power management Service in C using the existing C++ code as reference 2. Tool to parse contents from really complex specifications in Excel format 3. Tool to translate CJK contents to english for feeding into KG
It's also annoying that OpenCode doesn't even try to support local LLMs properly.
Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
Pi is... just fine.
It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)
It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.
And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)
Which is something that all the other providers charge you api access rates for (ex - thousands a month).
But yes - it expands a lot if you're willing to play with it.
I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
https://cursor.com/blog/real-time-rl-for-composer
About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.
On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.
- Firecrawl - fast web scraping
- SearxNG - metasearch
- CloakBrowser - tursile bypassing Playwright alternative
If you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
I should mention not to run it at less than q6, I prefer q8.
There's apparently a reason Sonnet and Haiku have been left in previous version #s.
Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.
I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
I don't understand, why does it make you think this is the case?
> how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself
Can you give an example?
Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
There is pretty heavy orchestration.
> I don't understand, why does it make you think this is the case?
Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
As I said, if you observe the output from these api endpoints you will notice it.
I thought that was the code harness simply minifying the outputs. Many models now no longer return the entire chain-of-thought (to avoid distillation attacks). So yes, we don't get the raw LLM output, but I think it's just the thinking summarized, not a complex orchestration or different models.
I do agree though that now cloud models are kind of a black box, that's not only obfuscated but also changes over time. Companies seem to be changing model capabilities without notifying users, or even hiddenly serving completely different models. This is even worse via OpenRouter, with providers serving open-source models, some of them serve heavily quantized versions or even completely different models.
Last time I checked, OpenAI even send (in the response) the summary of the thinking part alreafy in markdown, so opencode has to remove the formatting to format it to their liking.
> Many models now no longer return the entire chain-of-thought (to avoid distillation attacks).
This is what they say: to avoid distillation attacks. And to some large extent this is true. I am saying there is a side- effect and this side- effect (depending on how tin-foilly you want to go) may be either a nice thing to have or it may be the "main reason" for all of this.
The side effect is splicing the inference, brokering requests, and what not, which brings huge benefits at scale.
This was my original point: openweights model to a sota model may be apples to oranges. So when will a local model catchup with its single cot run which is not even shaped properly: well never.
It is apples to oranges.
But what they do not have is the correct shape, the correct approach. This is missing and it shows on multiple scales: it shows in the COT, it shows in the output itself, it shows in the infra to serve the models, it shows in the model orchestration.
This is what anthropic said one year ago:
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
It’s slower but you can run them.
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.
I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
harness - pi+custom extension for subagents
model - qwen3.6 35ba3b q4km
hardware - intel arrow lake with 32gb ram
server - llama.cpp vulkan
performance - 15-18t/s generation 50-150t/s pp
planning and task creation is still using claude/gpt but they dont touch the code. All coding is done using this setup.
Example of project made using this setup easyanalytica.com , its of medium size complexity
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
I did try 3.6 on my main desktop. It was good, but I didn't see much differences than coder, so I am still using my old rig.
https://discourse.ubuntu.com/t/use-workshop-to-run-opencode-...
Most small local models don't get tool calling right, however the larger models are now doing this correctly now.
One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
Some of the benchmarks appear to back this up [0]
Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.
[0]: https://artificialanalysis.ai/models/open-source/small?model...
It seems pretty intuitive that pouring more resources into a problem (more GPU, bigger GPUs with more VRAM, bigger datasets, better curated datasets, more efficient ways to train, more efficient way to run inference, etc) then running the result for a longer time, with more layers of verification (running in VMs, model fusion comparing multiple models, having harnesses with testing) will at least lead to marginally better results.
Is it worth it and at what pace will it keep on improving are different questions but I have little doubt that if the industry keep on pouring resources, sure more "works".
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
https://github.com/ndom91/llama-dash
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
My Homelab AI Dev Platform
https://news.ycombinator.com/item?id=48542433
Hardware:
- GPU: AMD 7900xtx, 24gb vram
- CPU: AMD 5950x, AM4
- RAM: 64gb DDR4 3600
Software:
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
- What "stack" do you recommend? Llama.cpp + OpenCode?
https://medium.com/p/f237d575e861
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.
Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
[1] https://leanpub.com/read/local-coding-agents
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
Is that characterization based on some objective facts or benchmarks?
I suspect many will realize millions more dollars are being spent than needed to achieve the highest marginal productivity gains, and reallocate accordingly. Who wants more of their money going to developer tooling, rather than bonuses?
That's way more economical and produces far better result than any self hosted models today.
I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.
[0] - https://pastes.io/9gaARxE8
[1] - https://jsfiddle.net/pou4nbh9/1/
Model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
The dream of distributed AI is a ways off.
You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.
Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.
The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
If I give it a page of context, can it write a linked list or identify a bad line of CSS?
Is there anywhere online I can chat with a model I could be running at home to see how good it is?
67M Ouput 51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
if youre shoopping for a new pc, very easy to justify 128gb vram
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200