Friday, May 13, 2011

HBase scalability for binary data - I wonder if Cassandra would have the same issue?

One interesting "learning" around HBase is that its not really a good idea to use it for storing tons of binary data e.g. photos, map tiles, audio files, etc
http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
http://www.quora.com/Apache-Hadoop/How-would-HBase-compare-to-Facebooks-Haystack-for-photo-storage
...the message and experience here is to store the meta data in HBase, but keep the actual binary data outside HBase. I'm hearing others mirroring this learning.

There is clearly issues in how partitioning in HBase works and the way in which it spreads the work load across nodes and rebalances itself. Interestingly, Apache Cassandra has two partitioners out of the box: random + ordered. As I understand it, the HBase partitioner is closer to the ordered version and therefore trying these same use-cases on Cassandra with a random partitioner might be interesting as a compare and contrast.

My working assumption is that its also safer to store binary data outside Cassandra if you want constant predictable response times and rely on highly available (i.e. replicated) storage that is really good at binary data that is write light, read heavy.

I'm interested to hear from anyone using Apache Cassandra who is storing large amounts of binary data (upwards of 10s of TBs).

S.