Hacker Newsnew | past | comments | ask | show | jobs | submit | DRMacIver's commentslogin

> To all you amateur Hegel enthusiasts out there: there is no synthesis in Hegel.

Looks like the mods deleted the last long thread about this, so best not to relitigate, but short version: Yes, we know. We liked the name and thought it was funny so we kept it.

> Otherwise: Congratulations on the QuickCheck-style testing in Rust. At work, I’m always surprised that property-based testing is so little known and so rarely used outside of functional programming.

Actually, it's Hypothesis-style testing in Rust. There was already QuickCheck style.

Property-based testing is in fact far more widely used in Python than in functional programming (probably not as a percentage of users, but in terms of raw numbers), which I'm always surprised that the functional programming community seems mostly unaware of.


> We liked the name and thought it was funny so we kept it.

It is funny, and I really like the reference.

> ... , which I'm always surprised that the functional programming community seems mostly unaware of.

Oh, I should have clicked on the Hypothesis link in the first paragraph. Thanks for pointing that out!

Edit: And it makes me smile that there was a long thread about it.


What do you think we're currently missing that Python's `from_type` has? I actually think the auto-deriving stuff we currently have in Rust is as good or better than from_type (e.g. it gets you the builder methods, has support for enums), but I've never been a heavy from_type user.

`from_type` just supports a bunch more things than rust ever can due to the flexibility of python's type system. `from_type(object)` is amazing, for example, and not something we can write in rust.

Yeah, that's true. I was going to say that it's maybe not fair to count things that just don't even make sense in Rust, but I guess the logical analogue is something like `Box<dyn MyTrait>` which it would make sense to have a default generator for but also we're totally not going to support that.

Please let us know how it goes!

As Liam says, the derive generator is not very well dogfooded at present. The claude skill is a bit better, but we've only been through a few iterations of using it and getting Claude to improve it, and porting from proptest is one of the less well tested areas (because we don't use proptest much ourselves).

I expect all of this works, but I'd like to know ways that it works less well than it could. Or, you know, to bask in the glow of praise of it working perfectly if that turns out to be an option.



The short answer to how it fits into existing ecosystems is... in competition I suppose. We've got a lot of respect for the people working on these libraries, but we think the Hypothesis-based approach is better than the various approaches people have adopted. I don't love that the natural languages for us to start with are ones where there are already pretty good property-based testing libraries whose toes we're stepping on, but it ended up being the right choice because those are the languages people care about writing correct software in, and also the ones we most want the tools in ourselves!

I think right now if you're a happy proptest user it's probably not clear that you should switch to Hegel. I'd love to hear about people trying, but I can't hand on my heart say that it's clearly the correct thing for you to do given its early state, even though I believe it will eventually be.

But roughly the things that I think are clearly better about the Hegel approach and why it might be worth trying Hegel if you're starting greenfield are:

* Much better generator language than proptest (I really dislike proptest's choices here. This is partly personal aesthetic preferences, but I do think the explicitly constructed generators work better as an approach and I think this has been borne out in Hypothesis). Hegel has a lot of flexible tooling for generating the data you want.

* Hegel gets you great shrinking out of the box which always respects the validity requirements of your data. If you've written a generator to always ensure something is true, that should also be true of your shrunk data. This is... only kindof true in proptest at best. It's not got quite as many footguns in this space as original quickcheck and its purely type-based shrinking, but you will often end up having to make a choice between shrinking that produces good results and shrinking that you're sure will give you valid data.

* Hegel's test replay is much better than seed saving. If you have a failing test and you rerun it, it will almost immediately fail again in exactly the same way. With approaches that don't use the Hypothesis model, the best you can hope for is to save a random seed, then rerun shrinking from that failing example, which is a lot slower.

There are probably a bunch of other quality of life improvements, but these are the things that have stood out to me when I've used proptest, and are in general the big contrast between the Hypothesis model and the more classic QuickCheck-derived ones.


This is helpful, thanks!

TBF PBT has been the present in Python for a while now.

10 years ago might have been a little early (Hypothesis 1.0 came out 11 years ago this coming Thursday), but we had pretty wide adoption by year two and it's only been growing. It's just that the other languages have all lagged behind.

It's by no means universally adopted, but it's not a weird rare thing that nobody has heard of.


> But the problem remains verifying that the tests actually test what they're supposed to.

Definitely. It's a lot harder to fake this with PBT than with example-based testing, but you can still write bad property-based tests and agents are pretty good at doing so.

I have generally found that agents with property-based tests are much better at not lying to themselves about it than agents with just example-based testing, but I still spend a lot of time yelling at Claude.

> So "a huge part" - possibly, but there are other huge parts still missing.

No argument here. We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.


A fun recent experience I had with Claude was I asked it to write a model for PBTs against a complex SUT, and it duplicated the SUT algorithm in the model — not helpful! I had to explicitly prompt it to write the model algorithm in a completely different style.

Ugh, yeah. Duplicating the code under test is a bad habit that Claude has had when writing property-based tests from very early on and has never completely gone away.

Hmm now that you mention it we should add some instructions not to do that in the hegel-skill, though oddly I've not seen it doing it so far.


> I have generally found that agents with property-based tests are much better at not lying to themselves

I also observed the cheating to increase. I recently tried to do a specific optimization on a big complex function. Wrote a PBT that checks that the original function returns the same values as the optimized function on all inputs. I also tracked the runtime to confirm that performance improved. Then I let Claude loose. The PBT was great at spotting edge cases but eventually Claude always started cheating: it modified the test, it modified the original function, it implemented other (easier) optimizations, ...


Ouch. Classic Claude. It does tend to cheat when it gets stuck, and I've had some success with stricter harnesses, reflection prompts and getting it to redo work when it notices it's cheated, but it's definitely not a solved problem.

My guess is that you wouldn't have had a better time without PBT here and it would still have either cheated or claimed victory incorrectly, but definitely agreed that PBT can't fully fix the problem, especially if it's PBT that the agent is allowed to modify. I've still anecdotally found that the results are better than without it because even if agents will often cheat when problems are pointed out, they'll definitely cheat if problems aren't pointed out.


> We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

Yeah, I know. Just an opportunity to talk about some of the delusions we're hearing from the "CEO class". Keep up the good work!


Conversation with Will (Antithesis CEO) a couple months ago, heavily paraphrased:

Will: "Apparently Hegel actually hated the whole Hegelian dialectic and it's falsely attributed to him."

Me: "Oh, hm. But the name is funny and I'm attached to it now. How much of a problem is that?"

Will: "Well someone will definitely complain about it on hacker news."

Me: "That's true. Is that a problem?"

Will: "No, probably not."

(Which is to say: You're entirely right. But we thought the name was funny so we kept it. Sorry for the philosophical inaccuracy)


If I had been wearing my fiendish CEO hat at the time, I might have even said something like: "somebody pointing this out will be a great way to jumpstart discussion in the comments."

One of the evilest tricks in marketing to developers is to ensure your post contains one small inaccuracy so somebody gets nerdsniped... not that I have ever done that.


A sort of broadening of Cunningham's Law (the fastest way to get an answer online is not by posting the question, but by posting the wrong answer—very true in my experience). If there's no issue of fact at hand, then you end up getting some engagement about the intentional malapropism/misattribution/mistake/whatever and then the forum rules tend to herd participants back to discussing the matter at hand: your company.

https://meta.wikimedia.org/wiki/Cunningham%27s_Law


Seth Godin made the case that its more important for people to make remarks than to be favorable (https://en.wikipedia.org/wiki/Purple_Cow:_Transform_Your_Bus...)

Trump did this a lot with the legacy media in his first term. He would make inaccurate statements to the media on the topic he wanted to be in the spotlight, and the media would jump to "fact check" him. Guess what, now everyone is talking about illegal immigration, tariffs, or whatever subject Trump thought was to their advantage.


"No such thing as bad publicity" is a very old idea. That quote is usually attributed to PT Barnum, but the idea is much older than him.

People always need to be reminded, though. It seems to be in human nature to fear bad publicity, and the people who fear it less end up with disproportionate power as a result.

If that's not motivation enough for you to rename it, well, TypeScript already has a static type checker called Hegel. https://hegel.js.org/ (It's a stronger type system than TypeScript.)

We looked at it and given that the repo was archived nearly two years ago decided it wasn't a problem.

I think it's more that Hegel was fine with "dialectics" but that the antithesis/synthesis stuff is not actually what's going on in his dialectic. It's a bit of a popular misconception about the role of negation and "movement" in Hegel.

I believe (unless my memory is broken) they get into this a bunch in Ep 15 of my favourite podcast "What's Left Of Philosophy": https://podcasts.apple.com/gb/podcast/15-what-is-dialectics-...

Also if you're not being complained about on HN, are you even really nerd-ing?


Post author here btw, happy to take questions, whether they're about Hegel in particular, property-based testing in general, or some variant on "WTF do you mean you wrote rust bindings to a python library?"

One thing I'm curious about, which I couldn't figure out from a skim of your post, is whether the generated test inputs are random, sequential, or adversarial.

IIRC there are fuzz testers that will analyze the branches of the code to look for edge cases that might break it -- that seems like something that would be wonderful to have in a property tester, but it also seems very difficult to do, especially in a language agnostic way.

How long does it take to find breaking cases like "0/0" or "ß"? Do they pop up immediately, or does it only happen after hundreds or thousands of runs?


They're random but with a lot of tweaks to the distribution that makes weird edge cases pop up with fairly high probability, and with some degree of internal mutation, followed by shrinking to turn them into nice tidy test cases. In Python we do a little bit of code analysis to find interesting constants, but Hegel doesn't do that, it's just tuned to common edge cases.

I think all the examples I had in the post are typically found in the first 100 test cases and reliably found in the first 1000, but I wouldn't swear that that's the case without double checking.

We don't do any coverage-guidance in Hegel or Hypothesis, because for unit testing style workflows it's rarely worth it - it's very hard to do good coverage guidance in under like... 10k test runs at a minimum, 100k is more likely. You don't have enough time to get really good at exploring the state space, and you haven't hit the point where pure random testing has exhausted itself enough that you have to do something smarter to win.

It's been a long-standing desire of mine to figure out a way to use coverage to do better even on short runs, and there are some kinda neat things you can do with it, but we've not found anything really compelling.


You mention in the post that there are design differences between Hegel/Hypothesis and QuickCheck, partly due to attitude differences between Python/non-Haskell programmers and Haskell programmers. As someone coming from the Haskell world (though by no means considering Haskell a perfect language), could you expand on what kinds of differences these are?

So I think a short list of big API differences are something like:

* Hypothesis/Hegel are very much focused on using test assertions rather than a single property that can be true or false. This naturally drives a style that is much more like "normal" testing, but also has the advantage that you can distinguish between different types of failing test. We don't go too hard on this, but both Hegel and Hypothesis will report multiple distinct failures if your test can fail in multiple ways.

* Hegelothesis's data generation and how it interacts with testing is much more flexible and basically fully imperative. You can basically generate whatever data you like wherever in your test you like, freely interleaving data generation and test execution.

* QuickCheck is very much type-first and explicit generators as an afterthought. I think this is mostly a mistake even in Haskell, but in languages where "just wrap your thing in a newtype and define a custom implementation for it" will get you a "did you just tell me to go fuck myself?" response, it's a nonstarter. Hygel is generator first, and you can get the default generator for a type if you want but it's mostly a convenience function with the assumption that you're going to want a real generator specification at some point soon.

From an implementation point of view, and what enables the big conveniences, Hypothesis has a uniform underlying representation of test cases and does all its operations on them. This means you get:

* Test caching (if you rerun a failing test, it will immediately fail in the same way with the previously shrunk example)

* Validity guarantees on shrinking (your shrunk test case will always be ones your generators could have produced. It's a huge footgun in QuickCheck that you can shrink to an invalid test case)

* Automatically improving the quality of your generators, never having to write your own shrinkers, and a whole bunch of other quality of life improvements that the universal representation lets us implement once and users don't have to care about.

The validity thing in particular is a huge pain point for a lot of users of PBT, and is what drove a lot of the core Hypothesis model to make sure that this problem could never happen.

The test caching is because I personally hated rerunning tests and not knowing whether it was just a coincidence that they were passing this time or that the test case had changed.


I'd love to see all this integrated with mutation testing, the thing being looked for being that the test input kills the mutant.

TBH reading the first few words of that section I was definitely expecting it to continue "so we used Claude to rewrite Hypothesis in Rust..." so that was quite a surprise!

It's on the agenda! We definitely want to rewrite the Hegel core server in rust, but not as much as we wanted to get it working well first.

My personal hope is that we can port most of the Hypothesis test suite to hegel-rust, then point Claude at all the relevant code and tell it to write us a hegel-core in rust with that as its test harness. Liam thinks this isn't going to work, I think it's like... 90% likely to get us close enough to working that we can carry it over the finish line. It's not a small project though. There are a lot of fiddly bits in Hypothesis, and the last time I tried to get Claude to port it to Rust the result was better than I expected but still not good enough to use.


To put it on the record: my position is current models can't get us there, and neither can the next iteration of models, but in two model iterations this will be worth doing. There's a lot of fiddly details in Hypothesis that are critical to get right. You can get a plausible 80% port with agents today but find they've structured it in a way to make it impossible to get to 100%.

Not really a question. Just wanted to express my gratitude for Hypothesis. I use it regularly. A few years back, I had to build a semi-formally-verified fund and account management service, and used the state-based-testing of Hypothesis to validate its correctness. Cannot express how invaluable this little framework has been.

A little while after that, I spoke to someone in the pharma-adjacent-space who was looking at Antithesis to validate their product. At the time, Antithesis (the company) told him that it was a bad fit. I suggested something akin to my previous approach (which did not include antithesis). No clue what they ended up doing, but it is nice to see that Hypothesis and Antithesis have finally joined forces.


You're very welcome! I'm glad it's been useful for you.

Why would I use this over the existing Proptest library in Rust?


How popular do you want it to be?

The Python survey data (https://lp.jetbrains.com/python-developers-survey-2024/) holds pretty consistently at 4% of Python users saying they use it, which isn't as large as I'd like, but given that only 64% of people in the survey say they use testing at all isn't doing too badly, and I think certainly falsifies the claim that Python programs don't have properties you can test.


FWIW shrinkray gets to 162 bytes if I leave it to run for about 10 minutes. https://gist.github.com/DRMacIver/ee025c90b4867125b382a13aaa...

I think it might do a little but not a lot better if I left it to run for longer but it looked to have mostly stalled then so I got bored and killed it.


I actually tried to Shrinkray it myself, but after ~3 minutes it crashed with the following stack trace:

      (4 nested BaseExceptionGroup: Exceptions from Trio nursery (1 sub-exception), and then:
      +-+---------------- 1 ----------------
        | Traceback (most recent call last):
        |   File "/nix/store/5cdw4sa0n0lwym47rd99a1q6b4dj7nr9-python3.12-shrinkray-0.0.0/lib/python3.12/site-packages/shrinkray/work.py", line 221, in parallel_map
        |     yield receive_out_values
        |   File "/nix/store/5cdw4sa0n0lwym47rd99a1q6b4dj7nr9-python3.12-shrinkray-0.0.0/lib/python3.12/site-packages/shrinkray/work.py", line 71, in map
        |     yield v
        | GeneratorExit
        +------------------------------------
Not sure if this is a known issue? Note that I ended up not respecting all of the version requirements in the pyproject.toml since I tried to run it all in Nix and there the Poetry support is currently quite lacking.


Interesting. I've not seen that before. My best guess is that it might be an old trio version or something, but I'm not sure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: