GLM-4.7: Advancing the Coding Capability

(z.ai)

173 points | by pretext 4 hours ago

14 comments

jtrn 2 hours ago
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.
[-]
- __natty__ 1 hour ago
  I can imagine someone from the past reading this comment and having a moment of doubt
  [-]
- mft_ 23 minutes ago
  I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?
  [-]
  - l9o 19 minutes ago
    RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
  - noahbp 10 minutes ago
    It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.
    Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.
  - deepsquirrelnet 18 minutes ago
    For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.
    You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.
- embedding-shape 2 hours ago
  > Supports tool calling in OpenAI-style format
  So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
  > in theory, I could get a "relatively" cheap Mac Studio and run this locally
  In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
  [-]
  - biddit 2 hours ago
    > In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
    Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
    It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.
    [-]
    - theLiminator 1 hour ago
      Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.
      [-]
      - cmrdporcupine 1 hour ago
        10k wouldn't even get you 1/4 of the way there. You couldn't even run this or DeepSeek 3.2 etc for that.
        Esp with RAM prices now spiking.
        [-]
        coder543 1 hour ago
        $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
        The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)
        [-]
        embedding-shape 7 minutes ago
        > $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
        Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:
        > In practice, it'll be incredible slow and you'll quickly regret spending that much money on it
        I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
        benjiro 54 minutes ago
        > $10k gets you a Mac Studio with 512GB of RAM
        Because Apple has not adjusted their pricing yet for the new ram pricing reality. The moment they do, its not going to be a $10k system anymore but in the $15k+...
        The amount of wafers going to AI is insane and will influence not just memory prices. Do not forget, the only reason why Apple is currently immunity to this, is because they tend to make long term contracts but the moment those expire ... then will push the costs down consumers.
        [-]
        tonyhart7 40 minutes ago
        generous of you to predict apple only make it 50% expensive
  - reissbaker 2 hours ago
    No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
  - rz2k 1 hour ago
    In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
    What example tasks would you try?
- reissbaker 2 hours ago
  s/Sonnet 3.5/Sonnet 4.5
  The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.
buppermint 1 hour ago
I've been playing around with this in z-ai and I'm very impressed. For my math/research heavy applications it is up there with GPT-5.2 thinking and Gemini 3 Pro. And its well ahead of K2 thinking and Opus 4.5.
polyrand 31 minutes ago
A few comments mentioning distillation. If you use claude-code with the z.ai coding plan, I think it quickly becomes obvious they did train on other models. Even the "you're absolutely right" was there. But that's ok. The price/performance ratio is unmatched.
Tiberium 2 hours ago
The frontend examples, especially the first one, look uncannily similar to what Gemini 3 Pro usually produces. Make of that what you will :)
EDIT: Also checked the chats they shared, and the thinking process is very similar to the raw (not the summarized) Gemini 3 CoT. All the bold sections, numbered lists. It's a very unique CoT style that only Gemini 3 had before today :)
[-]
- reissbaker 2 hours ago
  I don't mind if they're distilling frontier models to make them cheaper, and open-sourcing the weights!
  [-]
  - Imustaskforhelp 1 hour ago
    Same, although gemini 3 flash already gives a run for the cheaper aspect but a part of me really wants to get open source too because that way if I really want to some day, I can have privacy or get my own hardware to run it
    I genuinely hope that gemini 3 flash gets open sourced but I feel like that can actually crash the AI bubble if something like this happens because I genuinely feel like although there are still some issues of vibing with the overall model itself, I find it very competent overall and fast and I genuinely feel like at this point, there might be some placebo effects too but in reality, the model feels really solid.
    Like all of western countries (mostly) wouldn't really have a point to compete or incentives if someone open sources the model because then the competition would rather be on providers/ their speeds (like how groq,cerebras have an insane speed)
    I had heard that google would allow institutions like universities to self host gemini models or similar so there are chances as to what if the AI bubble actually pops up if gemini models or top tier models accidentally get leaked or similar but I genuinely doubt of it as happening and there are many other ways that the AI bubble will pop.
- ImprobableTruth 44 minutes ago
  How is the raw Gemini 3 CoT accessed? Isn't it hidden?
  [-]
  - Tiberium 7 minutes ago
    There are tricks on the API to get access to the raw Gemini 3 CoT, it's extremely easy compared to getting CoT of GPT-5 (very, very hard).
esafak 2 hours ago
The terminal bench scores look weak but nice otherwise. I hope once the benchmarks are saturated, companies can focus on shrinking the models. Until then, let the games continue.
[-]
- CuriouslyC 2 hours ago
  We're not gonna see significant model shrinkage until the money tap dries up. Between now and then, we'll see new benchmarks/evals that push the holes in model capabilities in cycles as they saturate each new round.
  [-]
  - lanthissa 2 hours ago
    isn't gemini 3 flash already model shrinkage that does well in coding?
    [-]
    - hedgehog 2 hours ago
      Smaller open-weights models are also improving noticeably (like Qwen3 Coder 30B), the improvements are happening at all sizes.
      [-]
      - cmrdporcupine 2 hours ago
        Devstral Small 24b looks promising as something I want to try fine tuning on DSLs, etc. and then embedding in tooling.
    - Imustaskforhelp 1 hour ago
      How much billion parameter model is gemini 3 flash, I can't seem to find info about it online.
- theshrike79 2 hours ago
  z.ai models are crazy cheap. The one year lite plan is like 30€ (on sale though).
  Complete no-brainer to get it as a backup with Crush. I've been using it for read-only analysis and implementing already planned tasks with pretty good results. It has a slight habit of expanding scope without being asked. Sometimes it's a good thing, sometimes it does useless work or messes things up a bit.
  [-]
  - sh3rl0ck 59 minutes ago
    I shifted from Crush to Opencode this week because Crush doesn't seem to be evolving in its utility; having a plan mode, subagents etc seems to not be a thing they're working on at the mo.
    I'd love to hear your insight though, because maybe I just configured things wrong haha
  - maxdo 1 hour ago
    I tried several times . It is no match in my personal experience with Claude models . There’s almost no place for second spot from my point of view . You are doing things for work each bug is hours of work, potentially lost customer etc . Why would you trust your money … just to back up ?
- bigyabai 2 hours ago
  It's a good model, for what it is. Z.ai's big business prop is that you can get Claude Code with their GLM models at much lower prices than what Anthropic charges. This model is going to be great for that agentic coding application.
  [-]
  - maxdo 1 hour ago
    … and wake up every night because you saved a few dollars , there are bugs and they are due to this decision?
    [-]
    - Imustaskforhelp 1 hour ago
      well I feel like all models are converging and maybe claude is good but only time will tell as gemini flash and GLM put pressure on claude/anthropic models
      People (here) are definitely comparing it to sonnet so if you take this stance of saving a few dollars, I am sure that you must be having the same opinion of using opus model and nobody should use sonnet too
      Personally I am interested in open source models because they would be something which would have genuine value and competition after the bubble bursts
desireco42 47 minutes ago
I've been using Z.Ai coding plan for last few months, generally very pleasant experience. I think with GLM-4.6 they had some issues which this corrects.
Overall solid offering, they have MCP you plug into ClaudeCode or OpenCode and it just works.
XCSme 2 hours ago
Funny how they didn't include Gemini 3.0 Pro in the bar chart comparison, considering that it seems to do the best in the table view.
[-]
- jychang 2 hours ago
  Also, funny how they included GPT-5.0 and 5.1 but not 5.2... I'm pretty sure they ran the benchmarks for 5.0, then 5.1 came out, so they ran the benchmarks for 5.1... and then 5.2 came out and they threw their hands up in the air and said "fuck it".
  [-]
  - amelius 25 minutes ago
    after or before running the benchmarks?
  - XCSme 2 hours ago
    I didn't even notice that, I assumed it was the latest GPT version.
- guluarte 1 hour ago
  Gemini is garbage and does it's own thing most of the time ignoring the instructions
gigatexal 2 hours ago
Even if this is one or two iterations behind the big models Claude or openai or Gemini it’s showing large gains. Here’s hoping this gets even better and better and I can run this locally and also that it doesn’t melt my PC.
[-]
- Imustaskforhelp 1 hour ago
  Although one would hope they can run it locally (which I hope so too but I doubt that with the increase of ram prices, I feel like its possible around 2027-2028). but Even if in the meanwhile we can't, I am sure that competition in general (on places like Openrouter and others) would give a meaningful way to cheapen the prices overall even further than the monopolistic ways of claude (let's say).
  It does feel like these models are only behind 6 months tho as many like to say and for some things its 100% reasonable to use it and for some others not so much.
tonyhart7 43 minutes ago
less than 30 bucks for entire year, insanely cheap
(I know that people must pay it on privacy) but still for maybe playing around with still worth it imo
cmrdporcupine 2 hours ago
Running it in Crush right now and so far fairly impressed. It seems roughly in the same zone as Sonnet, but not as good as Opus or GPT 5.2.
larodi 2 hours ago
From my limited exposure to these models, they seem very very very promising.
maxdo 1 hour ago
Funny enough they excluded 4.5 opus :)
observationist 1 hour ago
Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks.
Benchmarks aren't everything, but if you're going to contrast performance against a selection of top models, then pick the top models? I've seen a handful of companies do this, including big labs, where they conveniently leave out significant competitors, and it comes across as insecure and petty.
Claude has better tooling and UX. xAI isn't nearly as focused on the app and the ecosystem of tools around it and so on, so a lot of things end up more or less an afterthought, with nearly all the focus going toward the AI development.
$300/month is a lot, and it's not as fast as other models, so it should be easy to sell GLM as almost as good as the very expensive, slow, Grok Heavy, or so on.
GLM has 128k, grok 4 heavy 256k, etc.
Nitpicking aside, the fact that they've got an open model that is just a smidge less capable than the multibillion dollar state of the art models is fantastic. Should hopefully see GLM 4.7 showing up on the private hosting platforms before long. We're still a year or two from consumer gear starting to get enough memory and power to handle the big models. Prosumer mac rigs can get up there, quantized, but quantized performance is rickety at best, and at that point you look at the costs of self hosting vs private hosts vs $200/$300 a month (+ continual upgrades)
Frontier labs only have a few years left where they can continue to charge a pile for the flagship heavyweight models, I don't think most people will be willing to pay $300 for a 5 or 10% boost over what they can run locally.
[-]
- Alifatisk 23 minutes ago
  In my experience, Grok 4 expert performs way worse then what the benchmarks say.
  I’ve tried it with coding, writing and instructions following. The only thing it excels at currently and searching for things across the web is+ twitter.
  Otherwise, I would never use it for anything else. At coding, it always includes an error, when it patches it, it introduces another one. When writing creative text and had to follow instructions, it hallucinates a lot.
  Based on my experience, I am suspecting XAI for bench-maxing on Artificial Analysis because no way Grok 4 expert performs close to Gpt-5.2, Claude sonnet 4.5 and Gemini 3 pro
- lame-robot-hoax 1 hour ago
  Grok, in my experience, is extremely prone to hallucinations when not used for coding. It will readily claim to have access to internal Slack channels at companies, it will hallucinate scientific papers that do not exist, etc. to back its claims.
  I don’t know if the hallucinations extend to code, but it makes me unwilling to consider using it.
  [-]
  - observationist 46 minutes ago
    Fair - it's gotten significantly better over the last 4 months or so, and hallucinations aren't nearly as bad as they once were. When I was using Heavy, it was excellent at ensuring grounding and factual statements, but it's not worth $100 more than ChatGPT Pro in capabilities or utility. In general, it's about the same as ChatGPT Pro - once every so often I'll have to call out the model making something up, but for the most part they're good at using search tools and ensuring claims get grounding and confirmation.
    I do expect them to pull ahead, given the resources and the allocation of developers at xAI, so maybe at some point it'll be clearly worth paying $300 a month compared to the prices of other flagships. For now, private hosts and ChatGPT Pro are the best bang for your buck.
- kristianp 1 hour ago
  Perhaps people are steering clear of grok due to its extremist political training.
  [-]
  - observationist 57 minutes ago
    This is a silly meme.
    [-]
    - knowsuchagency 48 minutes ago
      Mecha hitler
      [-]
      - observationist 35 minutes ago
        Yes, an adventure in public facing bots that can pull from trending feeds, self referential system prompts, minimal guardrails, and that poor fellow Will Stancil.
        The absence of guard rails is a good thing - what happened with mechahitler was a series of feature rollouts that combined with Pliny trending, resulting in his latest grok jailbreak ending up in the prompt, followed by the trending mechahitler tweets, and so on. They did a whole lot of new things all at once with the public facing bot, and didn't consider unintended consequences.
        I'd rather a company that has a mechahitler incident and laughs it off than a company that pre-emptively clutches pearls on behalf of their customers, or smugly insists that we should just trust them, and that their vision of "safety" is best for everyone.
- claudiug 30 minutes ago
  every time i use grok is get some bad results. basically is all 1000% perfect from his point of view, review the code... "bollocks" methods that dont exists or just one line of code or method created with a nice comment: //#TODO implement