I got their z.ai plan to test alongside my Claude subscription; it feels about on par with something between sonnet 4.0 and sonnet 4.5. It's definitely a few steps below current day Claude, but it's very capable.
dumbness usually comes from lack of information, humans are the same way - the difference between other llms is that if opus has information it has a ridiculously high accuracy on tasks.
z.ai (Zhipu AI) is a chinese run entity, so presumably China's National Intelligence Law put in place in 2018, which requires data exfiltration back to the government, would apply to the use of this. I wouldn't feel comfortable using any service that has that fundamental requirement.
Google, OpenAI, Anthropic and Y Combinator are US run entities, so presumably the CLOUD Act and FISA require data exfiltration back to the government when asked, on top of the all the "Room 641A"s where the NSA directly taps into the ISP interconnects, would apply to the use of them. I wouldn't feel comfortable using any service that has that fundamental requirement.
I wouldn't use any provider: z.ai, Claude, OpenAI, ... if I was concerned about the government obtaining my prompts. If you're doing something where this is a legitimate concern (as opposed to my open source stuff), you should get a local LLM or put a lot of effort into anonymizing yourself and your prompts.
I've had the $20/month plan for a few months alongside a max subscription to Claude; the cheap codex plan goes a really long way. I use it a few times a day for debugging, finding bugs, and reviewing my work. I've ran out of usage a couple of times, but only when I lean on it way more than I should.
I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.
Dafny and similar languages use SMT; their semantics need to be such that you're giving enough information for your proof to verify in sufficient time, otherwise you'll be waiting for a very long time or your proof is basically undecidable.
I'm not sure about benchmarks comparing languages, but Dafny goes through a lot of tweaking to make the process faster.
Is that interesting? Computers accomplish all sorts of tasks which require thinking from humans.. without thinking. Chess engines have been much better than me at chess for a long time, but I can't say there's much thinking involved.
I admit that when reading the description of your relationship (I don't mean to be disrespectful, for what it's worth) I can't help but wonder how it can possibly be consistent with "a relationship between two people can be basically whatever they want it to be." It really reads like the relationship is whatever _she_ wants it to be.
If you had come into the relationship with the understanding that you'd both date/have sex with other people then great; it doesn't matter what other people think. However, when you say that it was hard for you to accept her being with other men, and that you're lucky that "she has never fallen in love and wanted to run away with one of em", damn. My first instinct is that you should take your own advice: find or design a relationship where you don't have to accept this.
I realize that some of my knee jerk reaction might just be instinct/cultural values, I mean no disrespect.
Calling things "slop" is just begging the question. The real differentiating factor is that, in the past, "human-generated slop" at least took effort to produce. Perhaps, in the process of producing it, the human notices what's happening and reconsiders (or even better, improves it such that it's no longer "slop".) Claude has no such inhibitions. So, when you look at a big bunch of code that you haven't read yet, are you more or less confident when you find out an LLM wrote it?
If you try and one shot it, sure, but if you question Claude, point out the errors of its ways, tell it to refactor and ultrathink, point out that two things have similar functionality and could be merged. It can write unhinged code with duplicate unused variable definitions that don't work, and it'll fix it up if you call it out, or you can just do it yourself. (cue questions of if, in that case, it would just be faster to do it yourself.)
I have a Claude max subscription. When I think of bad Claude code, I'm not thinking about unused variable definitions. I'm thinking about the times you turn on ultrathink, allow it to access tools and negotiate it's solution, and it still churns out an over complicated yet partially correct solution that breaks. I totally trust Claude to fix linting errors.
It's hard to really discuss in the abstract though. Why was the generared code overly complicated? (I mean, I believe you when you say it was, but it doesn't leave much room for discussion). Similarly, what's partially correct about it? How many additional prompts does it take before you a) use it as a starting point b) use it because it works c) don't use any of it, just throw it away d) post about why it was lousy to all of the Internet reachable from your local ASN.
I've read your questions a few times and I'm a bit perplexed. What kind of answers are you expecting me to give you here? Surely if you use Claude Code or other tools you'd know that the answers are so varying and situation specific it's not really possible for me to give you solid answers.
However much you're comfortable sharing! Obviously ideal would be the full source for the "overly complicated" solution, but naturally that's a no go, so even just more words than a two word phrase "overly complicated". Was it complicated because it used 17 classes with no inheritance and 5 would have done it? Was it overly complicated because it didn't use functions and so has the same logic implemented in 5 different places?
I'm not asking you, generically, about what bad code do LLMs produce. It sounds like you used Claude Code in a specific situation and found the generated code lacking. I'm not questioning that it happened to you, I'm curious in what ways it was bad for your specific situation more specifically than "overly complicated". How was it overly complicated?
Even if you can't answer that, maybe you could help me reword the phrasing of my original comment so it's less perplexing?
You're proposing a truism: if you don't get a good result, it's either because your query is bad or because the LLM isn't good enough to provide a good result.
Yes, that is how this works. I'm talking about the case where you're providing a good query and getting poor results. Claiming that this can be solved by more LLM conversations and ultrathink is cope.
I've claimed neither. I actually prefer restarting or rolling back quickly rather than trying to re-work suboptimal outputs - less chance of being rabbit holed. Just add what I've learned to the original ticket/prompt.
I have pretty much the same amount of confidence when I receive AI generated or non-AI generated code to review: my confidence is based on the person guiding the LLM, and their ability to that.
Much more so than before, I'll comfortably reject a PR that is hard to follow, for any reason, including size. IMHO, the biggest change that LLMs have brought to the table is that clean code and refactoring are no longer expensive, and should no longer be bargained for, neglected or given the lip service that they have received throughout most of my career. Test suites and documentation, too.
(Given the nature of working with LLMs, I also suspect that clean, idiomatic code is more important than ever, since LLMs have presumably been trained on that, but this is just a personal superstition, that is probably increasingly false, but also feels harmless)
The only time I think it is appropriate to land a large amount of code at once is if it is a single act of entirely brain dead refactoring, doing nothing new, such as renaming a single variable across an entire codebase, or moving/breaking/consolidating a single module or file. And there better be tests. Otherwise, get an LLM to break things up and make things easier for me to understand, for crying out loud: there are precious few reasons left not to make reviewing PRs as easy as possible.
So, I posit that the emotional reaction from certain audiences is still the largest, most exhausting difference.
The code I've seen generated by others has been pretty terrible in aggregate, particularly over time as the lack of understanding and coherent thought starts to show. Quite happy without it thanks, haven't seen it adding value yet.
Or is the bad code you've seen generated by others pretty terrible, but the good code you've seen generated by others blends in as human-written?
My last major PR included a bunch of tests written completely by AI with some minor tweaking by hand, and my MR was praised with, "love this approach to testing."
I think maybe there's another step too - breaking the design up into small enough peices that the LLM can follow it, and you can understand the output.
So do all the hard work yourself and let the AI do some of the typing, that you’ll have to spend extra time reviewing closely in case its RNG factor made it change an important detail. And with all the extra up front design, planning, instructions, and context you need to provide to the LLM I’m not sure I’m saving on typing. A lot of people recommend going meta and having LLMs generate a good prompt and sequence of steps to follow, but I’ve only seen that kinda sorta work for the most trivial tasks.
Unless you're doing something fabulously unique (at which point I'm jealous you get to work on such a thing), they're pretty good at cribbing the design of things if it's something that's been well documented online (canonically, a CRUD SaaS app, with minor UI modification to support your chosen niche).
And if you are doing something fabulously unique, the LLM can still write all the code around it, likely help with many of the components, give you at least a first pass at tests, and enable rapid, meaningful refactors after each feature PR.
I don't really understand your point. It reads like you're saying "I like good code, it doesn't matter if it comes from a person or an LLM. If a person is good at using an LLM, it's fine." Sure, but the problem people have with LLMs is their _propensity_ to create slop in comparison to humans. Dismissing other people's observations as purely an emotional reaction just makes it seem like you haven't carefully thought about other people's experiences.
My point is that, if I can do it right, others can too. If someone's LLM is outputing slop, they are obviously doing something different: I'm using the same LLMs.
All the LLM hate here isn't observation, it's sour grapes. Complaining about slop and poor code quality outputs is confessing that you haven't taken the time to understand what is reasonable to ask for, aren't educating your junior engineers how to interact with LLMs.
> Perhaps, in the process of producing it, the human notices what's happening and reconsiders (or even better, improves it such that it's no longer "slop".)
Given the same ridiculously large and complex change, if it is handwritten only a seriously insensitive and arrogant crackpot could, knowing what's inside, submit it with any expectation that you accept it without a long and painful process instead of improving it to the best of their ability; on the other hand using LLM assistance even a mildly incompetent but valuable colleague or contributor, someone you care about, might underestimate the complexity and cost of what they didn't actually write and believe that there is nothing to improve.
There's probably a difference in degree, however. Alopecia Areata is much more uncommon, while regular male pattern baldness is very common.
There's also the fact that Alopecia Areata is actually more common in women, which I imagine exaggerates the distress compared to the more run of the mill MPB.
I realize you didn't mean to use a study on Alopecia Areata, but the difference in degree could be quite large.
It's also possible that people taking Finasteride might be a more potent selection of people that are distressed about hair loss, and are therefore more likely to exhibit depression, etc. As in, if people with androgenetic alopecia are more likely to be depressed, people who take finasteride may be a sampling of those people who are distressed enough to seek and maintain treatments.
Additionally, the kind of person who would reach for prescription medication vs accepting hair loss may be predisposed to depression. I.e. this may be selecting for people who struggle with self-acceptance generally.
I also wonder whether there's some degree of placebo going on. Patients know finasteride is anti-androgenic; perhaps when they inevitably experience some symptoms associated with hypogonadism they assume the worst and lament the choice between having hair and feeling youthful. This would also explain why many who get off finasteride don't notice their symptoms improve.
Personal bias: I've taken finasteride for years with no side effects.
This is exactly why people thought isotretinoin (brand name Accutane) caused suicides (and required huge hurdles to access for years). It turns out that people suffering from physical disfigurements, such as acne, are more prone to suicide than the general population. Not sure if this is also true of androgenetic alopecia but it would hardly be surprising.
I don't think we're saying different things. People who are distressed about their appearance are more likely to be depressed, and people who seek medicine and surgeries are probably more distressed still, and therefore more likely to be depressed, ..
It did jump out at me that the paper repeatedly cites studies that found a correlation between finasteride and psychological side effects, and then talks about them as though they're evidence of causation.