this one is exciting. It'll enable and accelerate a lot of devices on Ollama - especially around AMD GPUs not fully supported by ROCm, Intel GPUs, and iGPUs across different hardware vendors.
I'm looking forward to future ollama releases that might attempt parity with the cloud offerings. I've since moved onto the Ollama compatibility API on KoboldCPP since they don't have any such limits with their inference server.
In this case, it's not about whether it fits on my physical hardware or not. It's about what seems like an arbitrary restriction designed to start pushing users to their cloud offering.
Z.ai team is awesome and very supportive. I have yet to try synthetic.new. What's the reason for using multiple? Is it mainly to try different models or are you hitting some kind of rate limit / usage limit?
I tried synthetic.new prior to GLM-4.6...Starting in August...So I already had a subscription.
When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.
I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend.
z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.
I have been subscribing to both Claude and ChatGPT for over two years. Spent several months on Claude's $100 plan and couple months on ChatGPT's $200 plan but otherwise using their $20/month plans.
I cancelled Claude two weeks ago. Pure GLM-4.6 now and a tad of codex with my ChatGPT Pro subscription. I sometimes use ChatGPT for extended research stuff and non-tech.
I was a hardcore Claude fan too, but Sonnet 4.5 + the new weekly limits are really annoying.
I could deal with the limits, but holy shit is Sonnet 4.5 chatty. It produces as much useless crap as Opus 4.1 did. Might feel fun for Vibe Coders when the model pumps out tons of crap, but I want it to do what I asked, not try to get extra credit with "advanced" solutions and 500+ row "reports" after it's done. FFS.
Been testing crush + z.ai GLM 4.6 through Openrouter (had some credits in there it seems =) for this evening and I'm kinda loving it.
Z.ai is on the US Entities (banned from export/collab) list:
> “These entities advance the People’s Republic of China’s military modernization through the development and integration of advanced artificial intelligence research. This activity is contrary to the national security and foreign policy interests of the United States under Section 744.11 of the EAR.”
And Microsoft has been instrumental in helping to facilitate Israel's genocide of Palestinian people. Meta / Facebook did it in Myanmar. If you're paying to use any AI product, you're more than likely giving money to companies that either directly or indirectly contribute to genocide.
The difference between Ollama and llama.cpp boils down to "venture-backed product company" vs "community OSS project; creator’s separate company has angel/VC-style pre-seed". I hope even you could squint and see the difference :)
Btw, I feel like it's somewhat poor taste to comment about something that is effectively a competitor to you (even though you base your own product on it) and not disclose that you're working full-time at Ollama Inc. At the very least put the info in your profile.
sorry, I don't use 4chan, so I don't know what's said there.
May I ask what system you are using where you are getting memory estimations wrong? This is an area Ollama has been working on and improved quite a bit on.
Latest version of Ollama is 0.12.5 and with a pre-release of 0.12.6
I recently tested every version from 0.7 to 0.11.1 trying to run q5 mistral-3.1 on a system with 48GB of available vram across 2 GPUs. Everything past 0.7.0 gave me OOM or other errors. Now that I've migrated back to llama.cpp I'm not particularly interested in fucking around with ollama again.
as for 4chan, they've hated ollama for a long time because they built on top of llama.cpp and then didn't contribute upstream or give credit to the original project
I'm hopeful that in the future, more and more model providers will help optimize for given model quantizations - 4 bit (i.e. NVFP4, MXFP4), 8 bit, and a 'full' model.
Yeah, I think the idea that models that don't come from ollama.com are second class citizens was what made me fist start to think about migrating back to llama.cpp and then the memory stuff just broke the camel's back. I don't want to use a project that editorializes about what models and quants I should be using, if I wanted a product I don't have control over I'd just use a commercial provider. For what it's worth I actually did download the full fp16 and quant it using ollama and still had the memory error for completion's sake.
I truly don't understand the reasoning behind removing support for all the other quants, it's really baffling to me considering how much more useful running a 70b parameter at q3 is that not being able to run a 70b parameter model at all, etc. Not to mention forcing me to download hundreds of gigabytes of fp16 because compatibility with other quants is apparently broken, and forcing me to quant models myself.
Yeah, a lot of these corporate hackathons are basically just lead gen in disguise. "Use our SaaS product, maybe we’ll give you a t-shirt." They're more about getting conversions than actually teaching anything useful to the students.
Sorry about this. We are working really hard on providing a usage based pricing.
During the preview period we want to start offering a $20 / month plan tailored for individuals - and we are monitoring the usage and making changes as people hit rate limits so we can satisfy most use cases, and be generous.
https://github.com/21st-dev/1code