I'm not sure if you understood the author correctly - your points don't disagree with him. He is not talking about using hashes for identity, but for routing. And he is advocating something more general than what you said. You said, basically, "implement your own hash functions for your objects." He said, basically, "don't use Object.hashCode(); you can easily just hash the serialization of the object, or implement a hash function yourself."
Except he cited specific examples of distributed systems (Hadoop, Storm) that use hashCode() but also make it quite clear you need to ensure you provide a hashCode() implementation that is entirely derived from the state of the object which is preserved in serialized form.
No, not need to. He said it was convenient if you're already serializing the object anyway. Which, if you're routing an object in a distributed system, is probably true.
Yes, it is terribly convenient because if Type.hashCode() has problems with copies of objects, this will help:
int computeHash(byte[] serializedBytes) {
return serializeBytes.hashCode();
}
Now you've gone from a hashCode() method that won't work if the programmer screwed the pooch, to a hashCode() method that won't work period. ;-)
[Yes, it won't be screwed up if also you use your own hash.]
The problem with serialization is you lose the semantics of the object. The "same values to the same sequence of bytes" is a tighter constraint than you might think. The same "values" might have multiple representations which are logically identical but might still be relevant in any number of ways (for example, additional information that has nothing to do with identity).
[Yes, it won't be screwed up if also you use your own hash.]
It's strange that you disregard that in an aside, as it's his entire point.
Personally, I prefer to implement hashes on the values inside the object, and not its serialization. They are unrelated concepts, and it helps to keep the concepts separate in the implementation. But for the purposes of routing in a distributed system - which is what the post is about - it's an acceptable solution.
> It's strange that you disregard that in an aside, as it's his entire point.
Perhaps I misunderstood as to what his point was.
If it is his entire point, I'd wish he'd not add in the bit about serialization. It just confuses the point.
To be honest, what the point of the article is strikes me as confused, because he's crossing the gamut of a variety of topics even in his summary at the end. Maybe you can help me understand my reading comprehension problem. Looking at the summary I see a variety of points that aren't really related:
1) Don't use the implementation of Object.hashCode() for your distributed hash table. Okay, this should not be news to anyone who is at all proficient with Java. It also doesn't require a long article. I'm not sure if another article needs to be written. If one must, you can just say, "Object.equals() doesn't provide magical equivalence powers; it's a place holder for when you implement equivalence. BTW, as you'd expect, the Java standard library documents its equivalence semantics whenever it overrides the method itself." You don't even have to mention hashCode().
2) Your serialization framework ought to be how you implement equivalence. Umm, no. That couples the concepts serialization with equivalence. It is essentially arguing that you ought to implement equals() by first serializing both objects and then comparing their output! I'd argue they are related in the same manner as the articles description of the relationship between Object.hashCode()'s implementation and doing good distributed partition. Serialized equivalence is correlated with logical equivalence just enough to create a huge mess. Hadoop wanders about as close to this as one should by providing hooks for doing comparisons of objects without deserializing them. One should really, really NOT go down this road.
3) Since Object.hashCode()'s implementation doesn't provide equivalence, you should not use overrides of that implementation for your partitioning. That's just stupid, even beyond the confusion of interface and implementation. You're going to need equivalence logic for your keys. Are you really going to create your own "T.equivalent(T)" method whose behaviour is distinct from "T.equals(T)"? What does it mean when those two don't have the same value? This suggests a lack of understanding as to why Object.hashCode(Object) is there in the first place.
4) Your distributed system needs its own hash function, but not necessarily a secure one. Talk about getting it backwards. Built in hash mechanisms generally work fine for short circuiting equivalence tests and already make a pretty reasonable effort to deal with poor programming. In the end, using a different hash function might only help if there is a lot of entropy in the key that is being lost by the existing one, which tends to only occur with large numbers of reasonably large keys (ironically, String.hashCode() makes some pretty good trade offs here as compared to a generic hash function). This tends to be an issue that you never encounter, and to the extent that you do, it's just an optimization issue, rather than the logical one the article described. The only major hurdle that is (wisely) left on the table by built in hash tables is the security one. For tackling that you are best off using an HMAC (designed to address that space of problem) instead of a bare hash. MD5 really isn't the right choice for any case where you have a choice about your hash function, and SHA-1 really isn't any more either, so mentioning them in the article is just silly.
5) Implementations of hashCode() generally work great for in-process hash tables, but are lousy for distributed systems. For the vast majority of hashCode() implementations, if it provides a good in-process partition input, it provides a good out-of-process partition input. Along with its corresponding equals(Object) method, the implementation of Object.hashCode() is, by design, an exceptional case and shouldn't normally get invoked.
6) Form parameters sent to Java web servers are a great example of this problem, and they wisely avoid the DoS problem by limiting the number of keys. Ummm no. First, the form parameters sent to a Java web server are no more a distributed systems problem than any other user provided data shoved in to a lookup table. It's a pure trusted systems problem. The only thing 'distributed' about it is that it tends to involve a network somewhere. In fact, a web server ought to be worried about DoS problems that have nothing to do with hash tables (like, the memory footprint of the form data). Given their solution to these DoS exposures and the context, they really ought to just shove the parameters in to some kind of efficient tree.
The contract Object.hashCode() does not require a stable results between processes; it _only_ requires a stable result during a single run of an application.
So if you need a stable distributed hash, the article recommends using a different method than Object.hashCode(), for instance DistributedObject.stableHash(). Using Object.hashCode() can bite you since not all implementations will have stable over multiple runs.
But not e.g. protobuf. So don't write a distributed routing framework that assumes hashCode() is stable - if you do, it will work fine for Strings and HashMaps and all the standard java objects, and then fail when someone tries to use it with protobuf objects.
Your comment qualifies as being "wrong", because of one part: "...it will work fine for Strings and HashMaps and all the standard java objects...". In fact hashCode() is stable for a very small subset of the types that ship with standard Java, and even in some of those cases there are caveats. Arrays come pretty standard with Java, but...
boolean alwaysFalse(int[] a) { return a.hashCode() == a.clone().hashCode(); }
a.clone().equals(a) is false too; you can't use Array as a thing to route on any more than you can use Object. The point is that in standard java objects with value semantics have stable hashCode implementations, and this is not true in general
(And FWIW arrays are a weird corner of Java as far as I'm concerned; every codebase I've worked on avoids them as much as possible)
a.clone().equals(a) HAS to be false in that case. It's more than a bit of a given if a.hashCode() != a.clone().hashCode().
> you can't use Array as a thing to route on any more than you can use Object
Oh, sure you can. You just need to use java.util.Arrays's methods for "equals" and "hashCode" (or potentially deepHashCode).
Also, ArrayList does work.
Yes, that is totally F'd up.
> The point is that in standard java objects with value semantics have stable hashCode implementations, and this is not true in general
Value semantics are pretty rare in Java. I'm not even sure what you really mean there. Pass-by-value is doesn't happen with Java objects. I could see you referring to immutable types, but that doesn't include any of the collection classes. I'm guessing you mean "primitive type wrappers and collections".
It'd make more sense if you said something like, "pretty much anything that overrides Object.equals(Object)"... because that's the way it is supposed to work. They are rare in standard class library, because there is little business logic there. In practice, anything that resembles an identifier, and therefore all keys, tends to do the override though. Indeed, most of what people tend to call "business objects" tend to do the override. That's why the convention is there. Most importantly: almost all overrides do so in a fashion that is stable across processes. That's also why distributed frameworks can and should employ the convention/protocol.
That equals/hashCode methods in Object are following the Smalltalk trick of having protocol defined in Object even though you shouldn't really use it without subclassing. The Object method isn't a "reference implementation" of the protocol, but rather a placeholder (one they forgot to override in Array objects, and then tried to backdoor in with java.util.Arrays).
There is no convention that hashCode() should be stable across processes; the only thing that could reasonably define such a convention is the javadoc of Object#hashCode, which explicitly states otherwise. More pragmatically, there is an important set of objects, widely used in distributed systems, whose implementation of hashCode() is not stable across processes (namely protocol buffers objects). So distributed frameworks can't and shouldn't assume hashCode() is stable across processes.
Value semantics are a case where the identity shouldn't really exist at all, as all that matters is equivalence. There is a lot of room there in between where equivalence has meaning but identity can still play a role.
Arguably in Java only the native types (which aren't part of the Java standard library) really embody this, with their wrappers and String barely getting a nod, and Collections totally don't fit the bill. IIRC part of the language definition refers to the fact that all object have reference semantics.
Protocol Buffers are meant to be used like as a memento [http://en.wikipedia.org/wiki/Memento_pattern]. At most you should have a wrapper object providing behaviours like equivalence (in fact with Hadoop, I do this all the time with them, which is kind of a must anyway as Protocol Buffers don't implement Writable, let alone WritableComparable, etc.). Protocol Buffers very much don't have behaviour beyond serialization, which is exactly why they shouldn't have equivalence implementations. Consequently, if you are invoking equals(), hashCode(), etc. on them, "you are doing it wrong".
You're right though that there isn't a strict convention about hashCode() being stable across processes. It's more like a corollary that stems from implementing equals(Object) in terms of an object's equivalence. When you define equivalence for an object, you implement equals(Object) to represent that notion. Because of the contract that hashCode() has with equals(Object), the only cases where hashCode() shouldn't be stable across processes is where the objects actually aren't equivalent across processes.
This should fit quite well with the objectives of a distributed system. In particular, if you are defining something as a "key", it must have a very clear notion of equivalence for it, and in that context it not only should be stable across processes, it needs to be stable across the entire system. If you don't implement that logic in equals(Object), you're violating that contract. This means your hashCode() method needs to reflect that stability across processes. I can come up with some ways of screwing up hashCode() in that context while still avoiding breaking the contracts, but they're all contrived and in practice I've never seen anyone actually do that.
Seriously, if you are running in to these issues, you have bigger problems in the design.
Object.hashCode() typically returns the memory address of the object, obviously it won't be the same across processes. This should be common knowledge for all Java programmers. The only way to do what the author wants is to define the serialization rules and the hash function the serialized object is fed to. Again, this should be obvious to every Java programmer.
And isn't that scary? It's not like this is any kind of secret or obscure detail, it's something you would learn in any half-competent Java 101 class or if you bothered to read the docs to the methods you're calling.