I am imagining it. The only effectively useful feature of this is actually adding an extra server layer, not removing one.
--
Case 1: Startup X makes a webapp cluster that looks up user information and returns results. It calls a library, which looks up a hash key to query a disk, and returns data.
Problem 1: Lack of load balancing. If there are three disks, and user FRANK is on disk two, and user FRANK's data is getting queried 50x more than the other users, that second disk is toast performance-wise.
Problem 2: No redundancy plus short lifespan of disk means when the disks die the user data goes too.
--
Case 2: Big Company Y creates a storage application layer to intelligently do things with the data. They have a small cluster of machines with apps that take queries and do things with the data, and manage the data using key/value pairs on disks attached to a private storage switch.
Problem 1: Dependent on ethernet (and its overhead, and latency) for each query doesn't perform as fast as other disk interconnects; have to use hacks to increase performance. Network management now critical component of your storage functionality.
Problem 2: Because the Virtual Memory Manager is no longer managing a filesystem cache, all key/value fields must be cached by the application, so you're re-implementing a VMM layer in your storage application. (Because nobody is stupid enough to not cache random disk queries)
Problem 3: Relational queries become almost completely useless. Performance drags due to all the individual queries, and you end up building a new cache layer just so your database can speed up searches, or at worst case end up with an index-only cluster of these disks.
--
As you can tell by reading Backblaze's site, there are lots of different uses for storage and different requirements for each. But one thing that's pretty widely acknowledged is it's more efficient to have a really long single piece of storage versus lots of very short pieces. I imagine Backblaze will look at this and go: Why don't we just make our own?
--
Case 1: Startup X makes a webapp cluster that looks up user information and returns results. It calls a library, which looks up a hash key to query a disk, and returns data.
Problem 1: Lack of load balancing. If there are three disks, and user FRANK is on disk two, and user FRANK's data is getting queried 50x more than the other users, that second disk is toast performance-wise.
Problem 2: No redundancy plus short lifespan of disk means when the disks die the user data goes too.
--
Case 2: Big Company Y creates a storage application layer to intelligently do things with the data. They have a small cluster of machines with apps that take queries and do things with the data, and manage the data using key/value pairs on disks attached to a private storage switch.
Problem 1: Dependent on ethernet (and its overhead, and latency) for each query doesn't perform as fast as other disk interconnects; have to use hacks to increase performance. Network management now critical component of your storage functionality.
Problem 2: Because the Virtual Memory Manager is no longer managing a filesystem cache, all key/value fields must be cached by the application, so you're re-implementing a VMM layer in your storage application. (Because nobody is stupid enough to not cache random disk queries)
Problem 3: Relational queries become almost completely useless. Performance drags due to all the individual queries, and you end up building a new cache layer just so your database can speed up searches, or at worst case end up with an index-only cluster of these disks.
--
As you can tell by reading Backblaze's site, there are lots of different uses for storage and different requirements for each. But one thing that's pretty widely acknowledged is it's more efficient to have a really long single piece of storage versus lots of very short pieces. I imagine Backblaze will look at this and go: Why don't we just make our own?