> "needs hardware support to be performant and secure at the same time"
So does Poly1305; it just so happens that most popular processors have strong hardware support. Here's an exercise: implement both GHASH and Poly1305 for MSP430.
I think you're calling fast multipliers "hardware support", which is fair, but the hardware support needed by GHASH is idiosyncratic to things like GHASH. CLMUL is only a few years old and GCM is its primary use case.
Implementing a truly constant-time GCM in software without CLMUL is sufficiently hard that noone has managed to create a remotely competitive implementation. They're all either an order of magnitude slower or vulnerable to cache-timing attacks.
Poly1305 isn't a walk in the park, but doesn't need special hardware support for fast constant-time implementation. Though I will agree something like HMAC is much simpler.
So does Poly1305; it just so happens that most popular processors have strong hardware support. Here's an exercise: implement both GHASH and Poly1305 for MSP430.